
A new artificial intelligence system, LongVT, has demonstrated significant progress in processing and understanding long-form videos, achieving a 42.0 score on the challenging VideoSIAH-Eval benchmark. Developed by researchers from EvolvingLMMs-Lab and MiroMind AI, LongVT introduces an innovative approach that mimics human cognitive strategies for video comprehension, enabling models to "think" over extended video content by selectively focusing on crucial segments. The project's findings were detailed in a recent paper titled "LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling," published on arXiv.org.
Long-form videos pose a substantial challenge for large multimodal models (LMMs) due to the sheer volume of data and the sparse, often fleeting nature of critical information. As stated in the tweet by Rohan Paul, "Long videos are hard because the key moment might last only a second while the file itself can run for more than 1 hour." Traditional LMMs struggle with hallucinations and temporal misalignment when dealing with such extensive content. LongVT addresses this by training models to use a crop_video tool, allowing them to skim forward, identify potential key moments, and then "rewind a short part to watch carefully," much like a human viewer.
The core of LongVT's methodology is the "interleaved Multimodal Chain of Tool Thought" (iMCoTT), a loop that enables the model to propose time windows, fetch corresponding short clips, and re-evaluate based on new evidence. This iterative process allows the model to refine its guesses and pinpoint relevant events with greater accuracy. To facilitate this, the researchers created the VideoSIAH dataset, specifically designed with questions whose evidence resides in "tiny slices of hours long videos," complete with example traces demonstrating how a model navigates the timeline to find answers.
LongVT's training involves a three-stage process: initial supervised fine-tuning (SFT) to teach basic tool usage, followed by agentic reinforcement learning (RL) to enhance generalization and optimize decisions, and finally, reinforcement fine-tuning (RFT) to stabilize learned behaviors using the model's own best runs. This comprehensive strategy has enabled LongVT to surpass other open-source video models, significantly reducing "hallucinations about where and when events happen." The team has open-sourced all codes, data, and model checkpoints, fostering further research and development in the field.