Timelens Enables Accurate Video Understanding by Addressing Data Quality in Temporal Grounding Benchmarks

Summarize this article with:
Video understanding receives a significant boost from new research into how machines pinpoint moments within video footage, a process known as temporal grounding. Jun Zhang from Nanjing University, Teng Wang and Ying Shan from ARC Lab, Tencent PCG, along with their colleagues, present TimeLens, a comprehensive investigation into optimising large multimodal models for this challenging task.
The team identifies critical flaws in existing video benchmarks and introduces TimeLens-Bench, a rigorously re-annotated resource that reveals the unreliability of previous evaluation methods. This work culminates in TimeLens, a family of models that achieve state-of-the-art performance, surpassing even leading proprietary systems, and the researchers make all code and data publicly available to accelerate progress in the field.
Multimodal Large Language Models (MLLMs) excel at various video understanding tasks, but optimising them for Video Temporal Grounding (VTG), accurately identifying moments in videos based on language descriptions, remains a challenge. This work presents TimeLens, a systematic investigation into building MLLMs with strong VTG ability, focusing on both data quality and algorithmic design.
The team first exposed critical quality issues in existing VTG benchmarks, revealing unreliable evaluation standards and misleading performance metrics. They then introduced TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks, Charades-STA, ActivityNet Captions, and QVHighlights, with strict quality criteria. Analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. The researchers also addressed noisy training data through an automated re-annotation pipeline. Event Localization via Video Timestamp Annotation This research focuses on video temporal localisation, the task of identifying the start and end times of specific events within a video given a textual description. The process involves detailed annotation guidelines, ensuring annotators accurately pinpoint event occurrences and timings. Annotators follow multiple steps, including verifying event presence, rewriting ambiguous descriptions to reflect actual video content, polishing descriptions for clarity, and precisely timestamping event boundaries. Quality control measures, such as reviewing the entire video after annotation, ensure consistency and accuracy. TimeLens-Bench Reveals Flawed Video Evaluation Standards This work establishes a new baseline for video temporal grounding and delivers significant improvements in both data quality and algorithmic design. Researchers identified critical flaws in existing VTG benchmarks, revealing unreliable evaluation standards and misleading performance metrics. Through meticulous manual re-annotation of three popular datasets, Charades-STA, ActivityNet Captions, and QVHighlights, the team created TimeLens-Bench, a rigorously cross-validated benchmark that dramatically re-ranks models compared to legacy benchmarks. This correction proves the unreliability of prior evaluation standards and highlights the necessity of high-quality data for accurate assessment. Beyond evaluation, the team addressed noisy training data by creating TimeLens-100K, a large-scale, high-quality training dataset. Detailed analysis of algorithmic design principles revealed that interleaved textual encoding outperforms more complex strategies for representing timestamps. Furthermore, the researchers determined that VTG is fundamentally a perception-driven task, leading to the development of a thinking-free reinforcement learning approach with verifiable rewards (RLVR) that surpasses other training paradigms in both efficiency and performance. Through careful experimentation, researchers identified two key recipes for optimising RLVR training: early stopping when reward metrics plateau and difficulty-based data sampling. Integrating these insights culminated in the development of TimeLens models, a family of MLLMs that achieve state-of-the-art VTG performance, surpassing even proprietary models such as GPT-5 and Gemini-2. 5-Flash. Qualitative analysis revealed specific errors in existing datasets, including inaccurate annotations, multiple event occurrences, unclear queries, and duplicate entries, all of which were corrected in the refined TimeLens-Bench and TimeLens-100K datasets. These advancements establish a solid foundation for future research in building MLLMs with strong VTG capabilities. TimeLens Improves Video Temporal Grounding Accuracy This research presents TimeLens, a systematic investigation into building multimodal large language models (MLLMs) with improved video temporal grounding capabilities, the ability to accurately pinpoint moments in videos based on natural language descriptions.
The team addressed critical issues with existing benchmarks, revealing substantial quality problems that compromised reliable evaluation, and introduced TimeLens-Bench, a meticulously re-annotated suite for trustworthy assessment of model performance. Alongside this, they developed TimeLens-100K, a large-scale, high-quality training dataset created through an automated re-annotation pipeline to mitigate the impact of noisy data. Through comprehensive algorithmic exploration, the researchers identified key principles for effective MLLM training, culminating in the development of the TimeLens family of models. These models achieve state-of-the-art performance in video temporal grounding, surpassing both open-source alternatives and leading proprietary models such as GPT-5 and Gemini-2. 5-Flash. The authors acknowledge that existing benchmarks, even with re-annotation, may still contain inherent biases or limitations, and future work could focus on developing even more robust and comprehensive evaluation metrics. By releasing their code, data, and models, the team intends to provide a strong foundation for further advancements in building MLLMs with enhanced temporal video understanding. 👉 More information 🗞 TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs 🧠ArXiv: https://arxiv.org/abs/2512.14698 Tags:
