Token Expand-Merge: Training-Free Compression Accelerates Billion-Parameter Vision-Language-Action Model Inference

Summarize this article with:
Vision-Language-Action models now underpin increasingly sophisticated robotic systems, but their immense size hinders real-time performance in practical applications. Yifan Ye, Jiaqi Ma, and Jun Cen, from Zhejiang University, alongside Zhihe Lu, present a new approach to accelerate these models without sacrificing accuracy. Their work introduces Token Expand-and-Merge-VLA, a technique that intelligently compresses information within the model during operation, rather than requiring costly retraining. By dynamically expanding key details and then merging redundant information, the team achieves a significant speed boost while maintaining, and even improving, the success rate of complex robotic tasks as demonstrated on the LIBERO benchmark. This advancement promises to unlock the potential of large-scale models for responsive and efficient robotic control.
Efficient Token Control for Vision-Language-Action Models Vision-Language-Action (VLA) models combine computer vision, natural language processing, and robotics to enable machines to understand the world and perform tasks. These models demand significant computational resources, hindering deployment in real-world applications, particularly on robots with limited processing power. Recent research focuses on improving efficiency through techniques like token pruning and merging, reducing computational load without sacrificing performance. Tokens are the fundamental units of information processed by these models, derived from visual inputs and language instructions. Token pruning identifies and removes unimportant tokens, while token merging combines multiple tokens into fewer, reducing the overall sequence length and computational cost. Key advancements include training-free methods, which prune or merge tokens without extensive retraining. Incorporating action-awareness, where the robot’s intended task guides the process, helps preserve relevant information. Memory mechanisms allow the model to store and retrieve visual and linguistic information, improving reasoning and action capabilities. The overarching trend in VLA research emphasizes efficiency, with token pruning and merging emerging as promising techniques. Action-awareness and memory mechanisms further enhance performance, and methods that avoid extensive retraining are crucial for practical deployment.
Token Compression Accelerates Vision-Language-Action Models Scientists have developed Token Expand-and-Merge-VLA (TEAM-VLA), a framework that accelerates inference speed in Vision-Language-Action (VLA) models without retraining. Recognizing the computational burden of large VLA models, the team engineered a method to reduce the number of visual tokens processed during inference, improving speed and efficiency in dynamic environments. TEAM-VLA reconstructs dense foreground regions from sparse vision-language cues, addressing limited object representation in initial token responses. This reconstruction utilizes a smoothing convolutional scan to selectively enlarge linguistically meaningful areas, supplemented by controlled random expansion to preserve potential foreground candidates and maintain structural completeness. To refine the process, the team incorporated a Token Merging mechanism that leverages action-text interactions to identify and retain task-relevant visual tokens. Scientists observed that intermediate layers reveal additional tokens encoding motion cues and spatial structure, crucial for maintaining functional information. TEAM-VLA retains the most action-text-responsive tokens and employs a soft bipartite merging mechanism to compress remaining tokens into semantically aligned groups, ensuring subtle cues are preserved. Experiments on the LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining, and even surpassing, the task success rate of full VLA models, offering a practical acceleration strategy.
Dynamic Tokens Accelerate Robotic Perception and Control Researchers have developed Token Expand-and-Merge-VLA (TEAM-VLA), a framework that accelerates robotic perception and control without retraining existing models. This work addresses the computational demands of large-scale Vision-Action (VLA) models, hindering real-time deployment in dynamic environments. TEAM-VLA achieves faster inference speeds by intelligently compressing tokens, the fundamental units of information processed by the model, while preserving task performance. The core of TEAM-VLA lies in a dynamic token expansion mechanism that identifies and samples additional informative tokens surrounding areas of attention, enhancing contextual understanding. These expanded tokens are then selectively merged in deeper layers, guided by the actions the robot is performing, effectively reducing redundancy without sacrificing semantic coherence. Experiments on the LIBERO benchmark demonstrate that TEAM-VLA reduces the inference time of OpenVLA-OFT by over 1. 5×, achieving a 99. 2% success rate with a latency of 68. 1 milliseconds, compared to 97. 6% and 109 milliseconds for the original OpenVLA-OFT. The method yields a 7. 7% higher success rate than EfficientVLA, while incurring only an additional 1. 5 milliseconds of inference time. Ablation studies confirm the importance of both token expansion and merging stages.
The team investigated the impact of different merging layer depths and kernel sizes, finding optimal performance with a kernel size of 3 and merging in layer 8. Notably, TEAM-VLA operates without relying on historical frame information, offering greater flexibility and adaptability.
Dynamic Token Merging Accelerates Vision-Language Models Researchers have presented Token Expand-and-Merge-VLA, a framework designed to accelerate inference speed in large vision-language-action models without retraining. This method addresses the computational demands of these models, hindering real-time deployment in dynamic environments. TEAM-VLA operates by dynamically expanding informative tokens in areas highlighted by the model’s attention mechanisms, enhancing contextual understanding. These expanded tokens are then selectively merged, reducing redundancy while preserving crucial semantic information. Extensive experiments on a standard benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining, and in some cases exceeding, the task success rate achieved by the full, uncompressed model. This achievement represents a significant step towards more efficient and practical robotic systems reliant on complex perception and control. 👉 More information 🗞 Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models 🧠 ArXiv: https://arxiv.org/abs/2512.09927 Tags:
