VLAG: Graph-Based Planning for Vision-Language-Action Models

Ardalan Aryashad, Yan Jin
USC IMPACT Laboratory
IDETC/CIE 2025

Abstract

We present a novel framework, Graph-Based Routing for Mixture of Vision-Language-Action Experts, designed for long-horizon robotic manipulation tasks. Our approach employs a Mixture of Experts (MoE) architecture that integrates a graph-based router with dedicated vision, language, and a set of action expert modules, enabling robust and efficient task planning and execution. The graph planner functions as a high-level router that interprets visual observations and language instructions to dynamically select the appropriate action expert for the current task. Specifically, our framework leverages a vision model with a MLP head to extract key environmental features from observation. The language model is fine-tuned from a pre-trained model to enhance instruction-to-task pairing accuracy, thus ensuring reliable and robust task recognition. The action experts are built on the Action Chunking with Transformers (ACT) architecture, modified to accommodate the vision and language modalities. The graph router is crucial to the framework’s functionality as it coordinates the strengths of the vision, language, and action experts, leading to a system that is both adaptable and computationally efficient. Overall, the modular design enables the flexible integration of its components, providing a scalable solution for robotic manipulation tasks.

System Overview

VLAG is a modular Mixture of Experts framework where a graph-based router interprets visual observations and language instructions, maps the current environmental state to graph nodes, and selects the appropriate action expert for each sub-task along a long-horizon sequence. The router encodes state transitions as directed edges, while the vision module (CLIP + MLP) extracts key element states, the language module (fine-tuned SBERT) aligns instructions to tasks, and ACT-based action experts execute specialized control policies; together, this design improves task selection, planning flexibility, and real-time action frequency.

Results

On CALVIN D→D, the vision module trained for 2k epochs with a CLIP (ViT-B/32) encoder and MLP heads predicts the four continuous key elements with under 5% average error and exceeds 98% accuracy on Boolean states, enabling reliable graph routing from visual cues. For language, we built ~10k instruction-task pairs across 34 tasks, evaluated CLIP and SBERT baselines, and fine-tuned paraphrase-MiniLM-L6-v2 with cosine similarity loss for 4 epochs, improving instruction-to-task accuracy from 63.5% to 98.4% while removing sensitivity to underscore formatting.

Demo