{"id":1147987,"date":"2025-08-14T20:00:46","date_gmt":"2025-08-15T03:00:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1147987"},"modified":"2025-08-14T20:00:48","modified_gmt":"2025-08-15T03:00:48","slug":"streammind-ai-system-that-responds-to-video-in-real-time","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/streammind-ai-system-that-responds-to-video-in-real-time\/","title":{"rendered":"StreamMind: AI system that responds to video in real time"},"content":{"rendered":"\n<p>Imagine a pair of smart glasses that detects its surroundings and speaks up at critical moments, such as when a car is approaching. That kind of split-second assistance could be transformative for people with low vision, but today\u2019s visual AI assistants often miss those moments.<\/p>\n\n\n\n<p>The problem isn&#8217;t that the technology can&#8217;t detect its environment. It&#8217;s that current AI systems get bogged down trying to analyze every single frame of video, dozens per second, slowing themselves down in the process. By the time they recognize what\u2019s happening, the moment for helpful intervention has passed.<\/p>\n\n\n\n<p>Now, researchers from Microsoft Research Asia and Nanjing University have designed a system aimed at overcoming this limitation. Their model, called <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/streammind-unlocking-full-frame-rate-streaming-video-dialogue-through-event-gated-cognition\/\">StreamMind<\/a>, processes video more like a human brain, skimming over uneventful moments and focusing only when something important occurs. The result is video processing that\u2019s up to ten times faster, quick enough to respond as events unfold.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-brain-inspired-approach\">A brain-inspired approach<\/h2>\n\n\n\n<p>The key insight is surprisingly simple: instead of analyzing every frame, StreamMind uses an event-gated network that separates fast perception from deeper analysis (Figure 1).<\/p>\n\n\n\n<p>A lightweight system continuously scans video for changes. Only when something meaningful occurs, like a car entering a crosswalk, does it trigger a more powerful large language model (LLM). This decoupling lets the perception module run at video speed, while the cognition module, the LLM, activates only when needed. By removing unneeded computation, StreamMind can keep pace with the video stream, maintaining real-time awareness of its environment.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"427\" height=\"163\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-1.png\" alt=\"diagram\" class=\"wp-image-1136423\" style=\"width:532px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-1.png 427w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-1-300x115.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-1-240x92.png 240w\" sizes=\"auto, (max-width: 427px) 100vw, 427px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Traditional streaming video framework (left) versus StreamMind\u2019s event-gated, decoupled perception and cognition modules (right).<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"demonstrations-streammind-in-action\">Demonstrations: StreamMind in action<\/h3>\n\n\n\n<p>In demonstrations, StreamMind provides responses that match the timing of the event, while current methods lagged. It kept pace with a soccer match, providing smooth play\u2011by\u2011play commentary, and guided a cook through a recipe step by step.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"720\" style=\"aspect-ratio: 1280 \/ 720;\" width=\"1280\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-demo-1.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 1. Navigation assistance: When compared with current methods, StreamMind responds as events occur, while other methods react noticeably later.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"720\" style=\"aspect-ratio: 1280 \/ 720;\" width=\"1280\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-demo-2.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 2. Sports commentary: In a live soccer match, it keeps up with the flow of play and delivers timely narration.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"720\" style=\"aspect-ratio: 960 \/ 720;\" width=\"960\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-demo-3.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 3. Cooking guidance: In a kitchen setting, the model provides instructions step-by-step, keeping pace with the action.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-the-technology-works\">How the technology works<\/h2>\n\n\n\n<p>StreamMind combines two key innovations to enable real-time video perception and response:<\/p>\n\n\n\n<p><strong>Smart memory system<\/strong><\/p>\n\n\n\n<p>The Event Perception Feature Extractor (EPFE) addresses the biggest bottleneck in current video AI models: how to handle incoming frames in real time without getting overwhelmed. It uses a state\u2011space model\u2014a method for tracking how data streams (such as video, audio, or sensor inputs) change over time\u2014to extract patterns from long, continuous input. This allows the EPFE to remember key events using just one compact piece of information, called a perception token, and enables the system to efficiently keep pace with the video stream.<\/p>\n\n\n\n<p><strong>Intelligent decision making<\/strong><\/p>\n\n\n\n<p>The second component determines whether what\u2019s occurring in the video is relevant to the user\u2019s request and whether the assistant should respond. This is a challenge because often there\u2019s no direct connection between a user\u2019s request and individual video frames. For example, a request like \u201chelp me fix my bike\u201d requires understanding when to jump in with assistance.<\/p>\n\n\n\n<p>To make those judgments, StreamMind draws on knowledge from an LLM to recognize when events are relevant and a response is needed. A small gating network, combined with a compact one-token summary of the video input, allows StreamMind to monitor events in real time and autonomously call on the LLM when it is time to act.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"811\" height=\"250\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-2.png\" alt=\"diagram\" class=\"wp-image-1136424\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-2.png 811w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-2-300x92.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-2-768x237.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-2-240x74.png 240w\" sizes=\"auto, (max-width: 811px) 100vw, 811px\" \/><figcaption class=\"wp-element-caption\">Figure 2: StreamMind architecture. EPFE (blue) continuously extracts video features. The gating network (labeled \u201cCognition Gate\u201d in red) decides whether to invoke the large model.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"testing-shows-major-speed-gains\">Testing shows major speed gains<\/h2>\n\n\n\n<p>When evaluated against existing methods, StreamMind\u2019s processing speed surpassed all other systems at every tested video speed. Even for fast 100-fps gaming video streams, it kept up with every frame in real time, something no previous system could manage (Figure 3).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"605\" height=\"374\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-3.png\" alt=\"chart, bar chart\" class=\"wp-image-1136425\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-3.png 605w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-3-300x185.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-3-240x148.png 240w\" sizes=\"auto, (max-width: 605px) 100vw, 605px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Frames per second (FPS): This chart shows the time it took for StreamMind as well as two popular video models to process one second of streaming video at different speeds (A100 GPU). StreamMind (the third bar in orange) achieves 100-fps processing speed.<\/figcaption><\/figure>\n\n\n\n<p>The researchers tested StreamMind in a range of scenarios, including online video commentary, predicting what would happen next in a video, and recognizing complex tasks like changing a tire or cooking. They used large datasets such as Ego4D (3,670 hours of first-person video from 923 participants across 74 locations), SoccerNet (videos of 12 European soccer matches), and COIN (11,827 instructional videos across 12 different subjects). The following tables show the detailed results of these tests.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"811\" height=\"215\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-4.png\" alt=\"table\" class=\"wp-image-1136426\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-4.png 811w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-4-300x80.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-4-768x204.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-4-240x64.png 240w\" sizes=\"auto, (max-width: 811px) 100vw, 811px\" \/><figcaption class=\"wp-element-caption\">Table 1. Results from theEgo4D and SoccerNet experiments<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"451\" height=\"193\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-5.png\" alt=\"table\" class=\"wp-image-1136427\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-5.png 451w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-5-300x128.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-5-240x103.png 240w\" sizes=\"auto, (max-width: 451px) 100vw, 451px\" \/><figcaption class=\"wp-element-caption\">Table 2: Ego4D LTA dataset experiments<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"454\" height=\"207\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-6.png\" alt=\"table\" class=\"wp-image-1136428\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-6.png 454w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-6-300x137.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/streammind-6-240x109.png 240w\" sizes=\"auto, (max-width: 454px) 100vw, 454px\" \/><figcaption class=\"wp-element-caption\">Table 3: COIN dataset experiments<\/figcaption><\/figure>\n\n\n\n<p>Across all tests comparing SteamMind\u2019s timing alignment and language modeling capabilities to those of existing streaming dialogue models, StreamMind delivered the best results, demonstrating that it can handle complex, fast-changing, real-world scenarios.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"from-lab-to-real-life\">From lab to real life<\/h2>\n\n\n\n<p>StreamMind\u2019s event-driven design could make wearable AI systems more responsive, allowing smart glasses and similar devices to react to important events as they happen rather than after the fact. By focusing on the moments that matter, rather than every frame, it could make smart glasses and similar devices far more responsive\u2014able to guide, warn, and assist in step with real-world events.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine a pair of smart glasses that detects its surroundings and speaks up at critical moments, such as when a car is approaching. That kind of split-second assistance could be transformative for people with low vision, but today\u2019s visual AI assistants often miss those moments. The problem isn&#8217;t that the technology can&#8217;t detect its environment. [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1136429,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1147987","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1147987","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1147987\/revisions"}],"predecessor-version":[{"id":1147992,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1147987\/revisions\/1147992"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1136429"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1147987"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1147987"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1147987"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1147987"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}