{"id":1159209,"date":"2025-12-19T00:01:03","date_gmt":"2025-12-19T08:01:03","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1159209"},"modified":"2025-12-19T01:16:48","modified_gmt":"2025-12-19T09:16:48","slug":"deep-video-discovery-using-agentic-search-to-analyze-long-form-video","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/deep-video-discovery-using-agentic-search-to-analyze-long-form-video\/","title":{"rendered":"Deep Video Discovery: Using agentic search to analyze long-form video"},"content":{"rendered":"\n<p>Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours\u2011long recordings packed with information and lacking obvious structure, current models are limited. They process videos slowly, are unable to connect information across long stretches of content, and often provide limited or unhelpful answers.<\/p>\n\n\n\n<p>To address these limitations, researchers at Microsoft Research Asia developed <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-video-discovery-agentic-search-with-tool-use-for-long-form-video-understanding\/\">Deep Video Discovery (DVD)<\/a>, an agentic AI framework for long-video analysis. DVD divides long videos into shorter clips for individual analysis, then uses LLM-based reasoning to plan next steps and select appropriate tools. The agent retrieves needed information and uses it to answer complex questions about the video.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-dvd-works\">How DVD works<\/h2>\n\n\n\n<p>DVD operates through a simple cycle: observe the video content, analyze what it means, and choose the next action. Current video-analysis systems follow rigid, predesigned steps that have difficulty adapting to different tasks. In contrast, DVD adjusts its approach based on information it has gathered so far. To support this flexibility, the system operates in two stages:<\/p>\n\n\n\n<p><strong>Stage 1: Building a searchable video database<\/strong><\/p>\n\n\n\n<p>The system converts long videos into a structured database, dividing them into five-second clips and extracting information at three levels:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Global<\/strong>: Provides a topic-level summary of the video.<\/li>\n\n\n\n<li><strong>Clip\u2011level<\/strong>: Includes subtitles and brief text descriptions of each segment.<\/li>\n\n\n\n<li><strong>Frame\u2011level<\/strong>: Includes individual frames and visual details captured moment by moment.<\/li>\n<\/ul>\n\n\n\n<p><strong>Stage 2: Retrieving information and generating answers<\/strong><\/p>\n\n\n\n<p>The system uses three core tools to search the database:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Global browse<\/strong>: Provides high\u2011level context and video summaries.<\/li>\n\n\n\n<li><strong>Clip search<\/strong>: Retrieves clips that match a description and returns relevant results with subtitles and timestamps.<\/li>\n\n\n\n<li><strong>Frame inspect<\/strong>: Examines a specific moment in the video and extracts fine visual details; it can also answer questions about what appears in that frame.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/deep-video-discovery-2-1024x390.png\" alt=\"DVD operates in two stages\"\/><figcaption class=\"wp-element-caption\">Figure 1. DVD operates in two stages. First, it converts long videos into a searchable database organized by clips and frames at multiple scales. Then, it answers user queries through autonomous search and tool use.<\/figcaption><\/figure>\n\n\n\n<p>The LLM serves as the system\u2019s orchestrator, running repeated observe-reason-act cycles based on gathered information. This design gives the agent autonomy, ensures that its answers stay grounded in actual video content, and allows the system to break complex questions into smaller, more manageable sub-questions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"dvd-achieves-state-of-the-art-performance-across-benchmarks\">DVD achieves state-of-the-art performance across benchmarks<\/h2>\n\n\n\n<p>DVD achieved state-of-the-art performance across multiple long\u2011video benchmarks (Table 1). On the challenging LVBench dataset, DVD reached 74.2% accuracy, outperforming all existing methods, a 13.4\u2011point gain over the previous best method, MR. Video. When transcript data was available, accuracy rose to 76.0%.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/deep-video-discovery-3-1024x789.png\" alt=\"DVD outperforms prior work by a large margin on LVBench\" style=\"aspect-ratio:1.2978526847074199;width:713px;height:auto\"\/><figcaption class=\"wp-element-caption\">Table 1. DVD outperforms prior work by a large margin on LVBench.<\/figcaption><\/figure>\n\n\n\n<p>DVD also exceeded previous state-of-the-art performance on three other long\u2011video benchmarks: LongVideoBench, Video MME Long, and EgoSchema, surpassing human\u2011level accuracy (approximately 76%) on EgoSchema.<\/p>\n\n\n\n<p>The choice of reasoning model critically affects DVD\u2019s performance (Figure 2). Replacing the reasoning model with OpenAI o4\u2011mini or GPT\u20114o causes sharp performance drops, indicating that limited reasoning capability breaks down agent\u2019s process. Different models also show distinct patterns in how they use tools; how deeply they analyze videos; and how accurately they respond. For example, GPT\u20114o often exhibits \u201coverconfidence,\u201d stopping its analysis prematurely. These observations offer practical guidance for designing future agents and developing foundational LLMs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/deep-video-discovery-4-1024x451.png\" alt=\"Analysis of how different foundation models behave inside the agent. The results show clear differences across models.\"\/><figcaption class=\"wp-element-caption\">Figure 2. Analysis of how different foundation models behave inside the agent. The results show clear differences across models.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toward-more-comprehensive-video-understanding\">Toward more comprehensive video understanding<\/h2>\n\n\n\n<p>As video content becomes richer and more complex, enabling AI to interpret and reason about what it captures, not just identify individual elements, is a central challenge in video comprehension. DVD offers one path forward through an agentic approach that is interpretable, plannable, and collaborative.<\/p>\n\n\n\n<p>Looking forward, researchers at Microsoft Research Asia are working to develop agents with stronger contextual awareness, and more advanced reasoning capabilities, advancing toward AI systems that can handle complex videos with greater depth, precision, and automation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours\u2011long recordings packed with information and lacking obvious structure, current models are [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1145256,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1159209","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1159209","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1159209\/revisions"}],"predecessor-version":[{"id":1159218,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1159209\/revisions\/1159218"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1145256"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1159209"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1159209"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1159209"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1159209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}