Multi-camera Encoding with Flex Achieves 2.2x Inference Gains for End-to-End Driving

Summarize this article with:
Autonomous driving systems require processing vast amounts of visual data from multiple cameras, creating a significant computational challenge, and researchers are now exploring new ways to streamline this process. Jiawei Yang, Ziyu Chen, and Yurong You, along with their colleagues, present a novel approach called Flex, an efficient scene encoder that compresses visual input for improved performance. Flex uniquely employs a small set of learnable scene tokens to jointly encode information from multiple camera views and time steps, crucially doing so without relying on pre-defined 3D structures commonly used in the field. Tested on a large dataset of 20,000 driving hours, the team demonstrates that Flex achieves a substantial two-fold increase in processing speed while simultaneously enhancing driving performance, and reveals that these scene tokens unexpectedly develop the ability to decompose complex scenes without explicit programming, suggesting a more scalable and effective path towards fully autonomous vehicles.
Disentangling Driving Policies with Scene Decomposition This research introduces Flexibility in Motion (FiM), a new approach to autonomous driving that separates what the car needs to achieve from how it achieves it. This separation allows for greater flexibility and adaptability in diverse driving scenarios, addressing limitations in current systems that often struggle with generalization due to monolithic policies. The authors propose that disentangling high-level goals from low-level control is crucial for building robust and flexible autonomous agents. FiM learns to decompose the driving scene into learned tokens representing different aspects of the environment and the car’s state, capturing both spatial and temporal information. These tokens are used to create a disentangled representation, separating goal tokens representing desired outcomes from control tokens representing specific actions. Separate policies are learned for predicting these tokens, allowing for independent control and adaptation. Experiments demonstrate that FiM improves generalization performance on unseen driving scenarios, enhances flexibility by adapting to different driving styles, and provides an efficient representation of the driving scene. Utilizing a transformer-based architecture and trained using self-supervised learning, the approach is scalable to large and complex environments and achieves competitive or state-of-the-art results on benchmark datasets. Efficient Multi-View Scene Encoding with Flex Scientists have developed Flex, a novel scene encoder designed to efficiently process the high volume of data generated by multi-camera systems in autonomous driving. Recognizing the computational demands of existing methods, they engineered a system that compresses visual information into a compact scene representation by initializing a small set of learnable scene tokens. These tokens are combined with all image tokens captured across multiple cameras and time steps and processed by a lightweight Transformer encoder utilizing self-attention layers, enabling joint encoding of information from all views and time points. Following encoding, the original image tokens are discarded, retaining only the updated scene tokens to be passed to the Large Language Model (LLM) based policy model. This deliberate bottleneck forces the system to compress information across both camera views and time, allowing the model to learn an optimal representation directly from the data without relying on pre-defined 3D structures, reducing the token budget passed to the policy model by a factor of 3 to 20. Experiments on a large-scale dataset comprising 20,000 hours of driving footage demonstrate Flex’s capabilities, achieving 2. 2times greater inference throughput compared to state-of-the-art image encoding strategies while simultaneously improving driving performance. This improvement stems from the joint encoding process, where scene tokens attend to all images across views and time, suppressing redundancy at a scene level.
Flex Achieves Fast, Efficient Scene Encoding for Driving Researchers present Flex, a novel scene encoder designed to address computational bottlenecks in autonomous driving systems processing data from multiple cameras and time steps. Flex achieves substantial compression of visual input by utilizing a small set of learnable scene tokens to jointly encode information from all image tokens, avoiding reliance on pre-defined three-dimensional assumptions. Experiments on a large-scale dataset of 20,000 driving hours demonstrate that Flex achieves 2. 2times greater inference throughput compared to state-of-the-art image encoding strategies. This breakthrough delivers not only increased processing speed, but also improved driving performance. The core of Flex lies in its ability to compress visual data into a compact set of scene tokens, reducing the token budget passed to the policy model by a factor of approximately 3 to 20. This compression is achieved through a lightweight Transformer encoder that jointly processes all tokens using self-attention layers, allowing the model to learn the optimal representation directly from data. Notably, the compressed token representation exhibits an emergent capability for scene decomposition, focusing on critical elements such as destination, lane markers, and safety-critical areas without specific supervision. Measurements confirm an inference throughput of 0. 80 clips per second, a substantial improvement over baseline methods achieving 0. 76 clips per second, while simultaneously improving driving performance as measured by minimum Average Displacement Error (minADE).
Learned Scene Tokens Enhance Driving Efficiency Researchers have developed Flex, a novel scene encoder designed to improve the efficiency and performance of autonomous driving systems. Flex addresses a key challenge in processing data from multiple cameras over time by using a small set of learned scene tokens to create a compact representation of the driving environment. Unlike many existing approaches, Flex does not rely on pre-defined three-dimensional assumptions about the scene, instead learning directly from the visual data itself. Evaluations on a large driving dataset demonstrate that Flex achieves a significant increase in processing speed while also improving driving performance compared to current state-of-the-art methods. Notably, the learned scene tokens exhibit an emergent ability to decompose the driving scene, focusing on important features like landmarks while allocating fewer resources to less informative areas. This specialization arises naturally during the learning process, without any explicit programming or supervision, and contributes to the system’s efficiency and effectiveness.
The team acknowledges that Flex, like all systems trained on specific datasets, may be limited by biases and coverage gaps in the training data. Future work will focus on broader evaluation across more diverse conditions and careful consideration of the social impact of autonomous driving technology, including safety and fairness. Further research will also explore the interpretability and specialization of the learned scene tokens, potentially revealing deeper insights into how the system understands and represents the driving environment. 👉 More information 🗞 Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving 🧠 ArXiv: https://arxiv.org/abs/2512.10947 Tags:
