We present a novel framework, Graph-Based Routing for Mixture of Vision-Language-Action Experts,
designed for long-horizon robotic manipulation tasks. Our approach employs a Mixture of Experts
(MoE) architecture that integrates a graph-based router with dedicated vision, language, and a
set of action expert modules, enabling robust and efficient task planning and execution. The graph
planner functions as a high-level router that interprets visual observations and language instructions
to dynamically select the appropriate action expert for the current task. Specifically, our framework
leverages a vision model with a MLP head to extract key environmental features from observation. The
language model is fine-tuned from a pre-trained model to enhance instruction-to-task pairing accuracy,
thus ensuring reliable and robust task recognition. The action experts are built on the Action Chunking
with Transformers (ACT) architecture, modified to accommodate the vision and language modalities. The
graph router is crucial to the framework’s functionality as it coordinates the strengths of the vision,
language, and action experts, leading to a system that is both adaptable and computationally efficient.
Overall, the modular design enables the flexible integration of its components, providing a scalable
solution for robotic manipulation tasks.