Foundationmotion: Automated Pipeline Constructs Large-Scale Motion Datasets for Spatial Reasoning and Dynamics Prediction

Summarize this article with:
Understanding movement is central to how we perceive and interact with the physical world, yet current artificial intelligence systems still struggle with accurately interpreting motion in videos. Yulu Gan, Ligeng Zhu, and Dandan Shan, along with colleagues at their respective institutions, address this challenge by introducing FoundationMotion, a novel system that automatically creates large-scale datasets for training these systems.
The team’s approach detects and tracks objects within videos, then uses this information to generate detailed descriptions and questions about the observed motion, effectively bypassing the need for expensive manual annotation. By utilising these automatically generated datasets to refine existing open-source models, the researchers achieve significant improvements in motion understanding, even surpassing the performance of leading closed-source systems and other open-source alternatives across a range of benchmarks, thus offering a scalable pathway to more intelligent and perceptive artificial intelligence. Video QA Pairs From Motion and Space This research details a system for generating and evaluating Question-Answering (QA) pairs for videos, focusing on analyzing motion and spatial relationships. The goal is to create a dataset for training and testing video understanding models that reason about how objects move and where they are positioned. The system utilizes video content, descriptive captions, and data detailing the location and type of objects within each frame to achieve this. The process involves a pipeline for generating QA pairs, employing detailed prompts for caption generation, QA creation, and evaluation. These prompts guide the system in generating questions covering motion-related concepts, including action recognition, temporal order, object-action associations, and spatial context. The system converts single-answer QA pairs into multiple-choice questions, ensuring distinctive and unambiguous choices. Evaluation metrics assess fine-grained action accuracy, motion detail, temporal coherence, and question relevance. The system operates by receiving a video, then generating a detailed caption emphasizing motion and spatial relationships. Object detection identifies and tracks objects within the video, providing bounding box data. This data, combined with the video and caption, enables the generation of QA pairs, which are then converted into multiple-choice questions and evaluated using the defined metrics. This dataset can be used to train models for applications such as video understanding, robotics, and autonomous systems, improving their ability to interpret and respond to dynamic scenes.
Automated Motion Dataset Construction Using Large Language Models Researchers have pioneered FoundationMotion, a fully automated pipeline for constructing large-scale motion datasets, addressing the scarcity of detailed motion data that limits current artificial intelligence systems.
The team engineered a system that detects and tracks objects within videos, utilizing advanced recognition and segmentation models to identify elements like vehicles, hands, and bodies. This tracking process generates structured data detailing object movement, forming the foundation for detailed motion analysis.
The team harnessed the power of large language models (LLMs) to automatically generate descriptive summaries and question-answer pairs based on the tracked object motions, enabling both motion understanding and interactive question-answering capabilities. This process yielded a dataset comprising approximately 500,000 question-answer pairs and descriptive captions, collectively known as the FoundationMotion Dataset 0. 2. To address the lack of benchmarks evaluating how motion occurs, researchers manually collected videos depicting various activities, annotating them with detailed question-answer pairs. Experiments involved fine-tuning several open-source vision-language models (VLMs) using the newly created FoundationMotion dataset and rigorously evaluating their performance on both publicly available benchmarks and the manually annotated “how” motion benchmark.
Results demonstrate that models trained with FoundationMotion significantly outperform larger open-source models and even closed-source systems like Gemini-2. 5-Flash, establishing a new standard for motion understanding and spatial reasoning.
The team intends to release all code, data, and benchmarks to foster community development and further advancements in the field.
Automated Motion Dataset Boosts Motion Understanding Performance Scientists have developed FoundationMotion, a fully automated pipeline for constructing large-scale motion datasets, addressing a critical limitation in the field of physical reasoning and motion understanding. The work delivers approximately 500,000 question-answer pairs and captions, collectively known as the FoundationMotion Dataset 0. 2, generated through automated video analysis and language modeling. This pipeline leverages state-of-the-art recognition and segmentation models, alongside large language models, to detect, track, and annotate object motion across diverse video content. Experiments reveal that fine-tuning open-source video language models with this dataset significantly improves performance on both public benchmarks and a newly created “how” motion benchmark, focusing on understanding how motions happen, not just what motions occur. Models trained with FoundationMotion outperform larger open-source models and even surpass the performance of closed-source models like Gemini-2. 5-Flash. Specifically, the team demonstrated superior performance across diverse motion understanding datasets and benchmarks, establishing a new standard for evaluating motion reasoning capabilities. The research addresses a significant gap in existing datasets, which often lack the fine-grained detail needed for accurate motion analysis, and the team manually collected videos and annotated question-answer pairs across domains including hand motion, driving scenarios, robot manipulation, and autonomous vehicle movement.
Results demonstrate that the automated pipeline successfully generates detailed motion summaries and question-answer pairs, enabling both motion understanding and question-answering over dynamic scenes.
The team will release all code, data, and benchmarks to foster community development and further advancements in the field.
Automated Motion Data Curation and Benchmarking FoundationMotion represents a significant advance in motion understanding through the development of a fully automated data curation pipeline. Researchers successfully constructed large-scale motion datasets, overcoming limitations imposed by the scarcity of fine-grained, annotated data typically required for training advanced models. This pipeline leverages object detection and tracking within videos, combined with large language models, to generate detailed captions and question-answer pairs focused on motion and spatial reasoning. The resulting datasets were then used to fine-tune open-source visual language models, achieving substantial performance improvements on motion understanding benchmarks. Notably, these fine-tuned models outperformed larger, closed-source alternatives, demonstrating the effectiveness of the automated data curation approach. This work provides a scalable solution for creating the high-quality datasets needed to enhance motion understanding and spatial reasoning capabilities in artificial intelligence systems. The authors acknowledge that the current system primarily operates within a 2D spatial framework, limiting its ability to fully capture the complexities of motion in three-dimensional space. Future research will focus on extending the system’s capabilities to encompass 3D motion understanding, particularly in applications such as robotics where precise spatial awareness is crucial. All code, data, and benchmarks will be released to encourage further development in this rapidly evolving field. 👉 More information 🗞 FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos 🧠 ArXiv: https://arxiv.org/abs/2512.10927 Tags:
