Skyra Enables AI Video Detection with Grounded Reasoning and a New 4K ViF-CoT Dataset

Summarize this article with:
The increasing prevalence of artificial intelligence-generated videos presents a growing challenge to discerning authentic content from synthetic media, demanding robust detection methods. Yifei Li, Wenzhao Zheng and Yanran Zhang, from Tsinghua University, alongside their colleagues, address this critical need with Skyra, a new system that identifies telltale visual inconsistencies, or artifacts, within AI-generated videos. Unlike existing approaches focused solely on classifying videos as real or fake, Skyra actively reasons about these grounded artifacts, providing both accurate detection and understandable explanations for its decisions.
The team developed a large, meticulously annotated dataset, ViF-CoT-4K, and a novel training strategy to enhance Skyra’s ability to perceive these subtle temporal inconsistencies, ultimately achieving superior performance against existing methods and offering valuable insights into the development of explainable AI for media authentication.
Video Authenticity Analysis Via Visual Inconsistencies Scientists have created Skyra, a system capable of analyzing videos and determining their authenticity by identifying specific visual inconsistencies and artifacts. Skyra categorizes these inconsistencies, such as shape distortion or camera motion inconsistencies, and classifies videos as either fake or real, functioning as an analytical tool that provides a determination of authenticity based on observed characteristics. Researchers are now prepared to receive descriptions of video content, frame by frame or as a summary, to analyze and provide a determination of authenticity, along with the type of artifact detected, if any. The analysis follows a clear format: a description of observed video characteristics, identification of any detected artifact types, and a final determination of whether the video is fake or real. Skyra Model and ViF-CoT-4K Dataset Development The research team developed Skyra, a specialized multimodal large language model, to reliably detect AI-generated videos and, crucially, to explain its reasoning. To enable this, scientists constructed ViF-CoT-4K, a novel large-scale dataset specifically designed for supervised fine-tuning, representing the first resource of its kind providing detailed human annotations of AI-generated video artifacts. This dataset comprises high-quality samples, forming a foundation for training a model capable of identifying inconsistencies often present in synthetic videos.
The team then implemented a two-stage training strategy, systematically enhancing the model’s ability to perceive spatio-temporal artifacts, articulate its reasoning, and ultimately improve detection accuracy. To rigorously evaluate Skyra’s performance, researchers introduced ViF-Bench, a benchmark comprising 3,000 high-quality video samples generated by over ten state-of-the-art video generators, ensuring a comprehensive assessment across diverse synthetic content. Experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, achieving superior performance in both detection and explainability, and providing valuable insights for advancing the field of explainable AI-generated video detection. Skyra Detects and Explains AI-Generated Videos Scientists have developed Skyra, a new multimodal large language model (MLLM) designed to identify artificially generated videos and explain its reasoning, addressing a critical need in the face of increasingly realistic synthetic media. The research team constructed ViF-CoT-4K, a large-scale dataset of AI-generated videos with detailed human annotations, to facilitate the training and evaluation of this system. This dataset represents a significant advancement, providing the first resource of its kind for supervised fine-tuning of AI-generated video artifact detection.
The team employed a two-stage training strategy to enhance Skyra’s ability to perceive subtle spatio-temporal artifacts, provide clear explanations, and improve detection accuracy. Initial supervised fine-tuning on ViF-CoT-4K endowed the model with essential detection and explanation abilities, preparing it for subsequent reinforcement learning. This initial stage involved full-parameter fine-tuning of Qwen2.5-VL-7B, utilizing a learning rate of 1e-5 for 5 epochs. Further refinement was achieved through reinforcement learning, utilizing a Group Relative Policy Optimization algorithm. The reward function encourages the model to actively explore potential forgery cues while strictly adhering to a prescribed output format. Accuracy rewards were designed asymmetrically, with more severe penalties for false positives, reflecting the inherent difficulty of comprehensively verifying the authenticity of real videos. This asymmetric design prevents overfitting and encourages the model to focus on identifying even a single artifact as evidence of manipulation. Extensive testing on ViF-Bench demonstrates that Skyra surpasses existing methods in both detection accuracy and explainability, offering valuable insights for advancing explainable AI-generated video detection. The research team observed that the model’s performance was significantly improved by encouraging active cue exploration and strict adherence to a prescribed output format, demonstrating the importance of both accuracy and explainability in this critical field. Pinpointing AI Video Artifacts with Skyra Skyra represents a significant advance in the detection of AI-generated videos, moving beyond simple identification to pinpointing and explaining the visual inconsistencies that reveal their artificial origin. Researchers developed a novel multimodal large language model, Skyra, specifically designed to identify human-perceivable artifacts within videos created by artificial intelligence. This approach allows for not only accurate detection of fake videos but also provides grounded evidence in the form of localized visual anomalies, offering a level of transparency absent in existing binary classification methods. To facilitate this achievement, the team constructed ViF-CoT-4K, a large-scale dataset of AI-generated videos annotated with detailed information about the artifacts present, enhancing Skyra’s ability to perceive subtle generative flaws. Further refinement through reinforcement learning improved the model’s capacity to identify discriminating artifacts, resulting in substantial gains in both detection accuracy and explainability. Extensive evaluation using the ViF-Bench benchmark demonstrates that Skyra consistently outperforms existing methods. The authors acknowledge that current methods still face challenges with extremely high-quality AI-generated videos where artifacts are minimal or imperceptible. Future research directions include exploring methods to improve robustness against increasingly sophisticated generative models and expanding the dataset to encompass a wider range of video types and artifact characteristics. Despite these limitations, Skyra establishes a promising new direction for explainable AI-generated video detection, offering a crucial tool for combating the spread of misinformation and maintaining trust in visual media. 👉 More information 🗞 Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning 🧠 ArXiv: https://arxiv.org/abs/2512.15693 Tags:
