Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours‑long recordings packed with information and lacking obvious structure, current models are limited. They process videos slowly, are unable to connect information across long stretches of content, and often provide limited or unhelpful answers.
To address these limitations, researchers at Microsoft Research Asia developed Deep Video Discovery (DVD), an agentic AI framework for long-video analysis. DVD divides long videos into shorter clips for individual analysis, then uses LLM-based reasoning to plan next steps and select appropriate tools. The agent retrieves needed information and uses it to answer complex questions about the video.
How DVD works
DVD operates through a simple cycle: observe the video content, analyze what it means, and choose the next action. Current video-analysis systems follow rigid, predesigned steps that have difficulty adapting to different tasks. In contrast, DVD adjusts its approach based on information it has gathered so far. To support this flexibility, the system operates in two stages:
Stage 1: Building a searchable video database
The system converts long videos into a structured database, dividing them into five-second clips and extracting information at three levels:
- Global: Provides a topic-level summary of the video.
- Clip‑level: Includes subtitles and brief text descriptions of each segment.
- Frame‑level: Includes individual frames and visual details captured moment by moment.
Stage 2: Retrieving information and generating answers
The system uses three core tools to search the database:
- Global browse: Provides high‑level context and video summaries.
- Clip search: Retrieves clips that match a description and returns relevant results with subtitles and timestamps.
- Frame inspect: Examines a specific moment in the video and extracts fine visual details; it can also answer questions about what appears in that frame.

The LLM serves as the system’s orchestrator, running repeated observe-reason-act cycles based on gathered information. This design gives the agent autonomy, ensures that its answers stay grounded in actual video content, and allows the system to break complex questions into smaller, more manageable sub-questions.
DVD achieves state-of-the-art performance across benchmarks
DVD achieved state-of-the-art performance across multiple long‑video benchmarks (Table 1). On the challenging LVBench dataset, DVD reached 74.2% accuracy, outperforming all existing methods, a 13.4‑point gain over the previous best method, MR. Video. When transcript data was available, accuracy rose to 76.0%.

DVD also exceeded previous state-of-the-art performance on three other long‑video benchmarks: LongVideoBench, Video MME Long, and EgoSchema, surpassing human‑level accuracy (approximately 76%) on EgoSchema.
The choice of reasoning model critically affects DVD’s performance (Figure 2). Replacing the reasoning model with OpenAI o4‑mini or GPT‑4o causes sharp performance drops, indicating that limited reasoning capability breaks down agent’s process. Different models also show distinct patterns in how they use tools; how deeply they analyze videos; and how accurately they respond. For example, GPT‑4o often exhibits “overconfidence,” stopping its analysis prematurely. These observations offer practical guidance for designing future agents and developing foundational LLMs.

Toward more comprehensive video understanding
As video content becomes richer and more complex, enabling AI to interpret and reason about what it captures, not just identify individual elements, is a central challenge in video comprehension. DVD offers one path forward through an agentic approach that is interpretable, plannable, and collaborative.
Looking forward, researchers at Microsoft Research Asia are working to develop agents with stronger contextual awareness, and more advanced reasoning capabilities, advancing toward AI systems that can handle complex videos with greater depth, precision, and automation.