{"id":1153693,"date":"2025-11-12T04:00:20","date_gmt":"2025-11-12T12:00:20","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1153693"},"modified":"2025-11-12T10:17:41","modified_gmt":"2025-11-12T18:17:41","slug":"mmctagent-enabling-multimodal-reasoning-over-large-video-and-image-collections","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/mmctagent-enabling-multimodal-reasoning-over-large-video-and-image-collections\/","title":{"rendered":"MMCTAgent: Enabling multimodal reasoning over large video and image collections"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1.jpg\" alt=\"Three white icons on a blue-to-purple gradient background: the first icon shows an image\/photo; the second icon depicts a computer monitor with vertical bars; the third icon displays three connected circles with user silhouettes.\" class=\"wp-image-1153930\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis.<\/p>\n\n\n\n<p>Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the context limits of most models.\u202fIt also entails querying across massive multimodal libraries of videos, images, and transcripts, where finding and integrating relevant evidence requires more than retrieval\u2014it requires strategic reasoning. Existing models typically perform single-pass inference, producing one-shot answers. This limits their ability to handle tasks that require temporal reasoning, cross-modal grounding, and iterative refinement.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"mmctagent\">MMCTAgent<\/h2>\n\n\n\n<p>To meet these challenges, we developed the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mmctagent-multi-modal-critical-thinking-agent-framework-for-complex-visual-reasoning\/?msockid=153992cb7df169482b9487167c0968e9\" target=\"_blank\" rel=\"noreferrer noopener\">Multi-modal Critical Thinking Agent<\/a>, or MMCTAgent, for structured reasoning over long-form video and image data, available on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/MMCTAgent\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and featured on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/labs.ai.azure.com\/projects\/mmct-agent\/\">Azure AI Foundry Labs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>Built on <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/autogen\" target=\"_blank\" rel=\"noreferrer noopener\">AutoGen<\/a>, Microsoft\u2019s open-source multi-agent system, MMCTAgent provides multimodal question-answering with a Planner\u2013Critic architecture. This design enables planning, reflection, and tool-based reasoning, bridging perception and deliberation in multimodal tasks. It links language, vision, and temporal understanding, transforming static multimodal tasks into dynamic reasoning workflows.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Unlike conventional models that produce one-shot answers, MMCTAgent has modality-specific agents, including ImageAgent and VideoAgent, which include tools like get_relevant_query_frames() or object_detection-tool(). These agents perform&nbsp;deliberate, iterative reasoning\u2014selecting the right tools for each modality, evaluating intermediate results, and refining conclusions through a Critic loop. This enables MMCTAgent to analyze complex queries across long videos and large image libraries with explainability, extensibility, and scalability.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/labs.ai.azure.com\/projects\/mmct-agent\/\">MMCTAgent on Azure AI Foundry Labs<\/a><\/div>\n<\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"how-mmctagent-works\">How MMCTAgent works<\/h2>\n\n\n\n<p>MMCTAgent integrates two coordinated agents, Planner and Critic, orchestrated through AutoGen. The Planner agent decomposes a user query, identifies the appropriate reasoning tools, performs multimodal operations, and drafts a preliminary answer. The Critic agent reviews the Planner\u2019s reasoning chain, validates evidence alignment, and refines or revises the response for factual accuracy and consistency.<\/p>\n\n\n\n<p>This iterative reasoning loop enables MMCTAgent to improve its answers through structured self-evaluation\u2014bringing reflection into AI reasoning. A key strength of MMCTAgent lies in its modular extensibility. Developers can easily integrate new, domain-specific tools\u2014such as medical image analyzers, industrial inspection models, or specialized retrieval modules\u2014by adding them to ImageQnATools or VideoQnATools. This design makes MMCTAgent adaptable across domains.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"videoagent-from-ingestion-to-long-form-multimodal-reasoning\">VideoAgent: From ingestion to long-form multimodal reasoning<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"14353\" height=\"8455\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/MMCT_UPDATED_FINAL_FINAL.png\" alt=\"MMCTAgent\u2019s Planner\u2013Critic architecture enables multimodal reasoning over long-form video through structured ingestion, retrieval, and iterative feedback.\u00a0\" class=\"wp-image-1155366\"\/><figcaption class=\"wp-element-caption\">Figure 1. MMCTAgent\u2019s Planner\u2013Critic architecture enables multimodal reasoning over long-form video through structured ingestion, retrieval, and iterative feedback<\/figcaption><\/figure>\n\n\n\n<p>The VideoAgent extends this architecture to long-form video reasoning. It operates in two connected phases: library creation (ingestion) and query-time reasoning.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"phase-1-video-ingestion-and-library-creation\">Phase 1 \u2013 Video ingestion and library creation<\/h4>\n\n\n\n<p>Before reasoning, long-form videos undergo an ingestion pipeline that aligns multimodal information for retrieval and understanding:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Transcription <\/strong>and<strong> translation<\/strong>: Converts audio to text and, if multilingual, translates transcripts into a consistent language&nbsp;<\/li>\n\n\n\n<li><strong>Key-frame identification<\/strong>: Extracts representative frames marking major visual or scene changes<\/li>\n\n\n\n<li><strong>Semantic chunking <\/strong>and<strong> chapter generation<\/strong>: Combines transcript segments and visual summaries into coherent, semantically segmented chapters with associated key frames. Inspired by Microsoft\u2019s <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-video-discovery-agentic-search-with-tool-use-for-long-form-video-understanding\/\" target=\"_blank\" rel=\"noreferrer noopener\">Deep Video Discovery agentic search tool<\/a>, this step also extracts detailed descriptions of objects, on-screen text, and characters present within each video segment, integrating these insights directly into the corresponding chapters.&nbsp;<\/li>\n\n\n\n<li><strong>Multimodal embedding creation<\/strong>: Generates image embeddings for key frames, linking them to their corresponding transcript and chapter data<\/li>\n<\/ol>\n\n\n\n<p>All structured metadata, including transcripts, visual summaries, chapters, and embeddings, is indexed in the Multimodal Knowledgebase using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-what-is-azure-search\" target=\"_blank\" rel=\"noopener noreferrer\">Azure AI Search<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which forms the foundation for scalable semantic retrieval and downstream reasoning.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"phase-2-video-question-answering-and-reasoning\">Phase 2 \u2013 Video question answering and reasoning<\/h4>\n\n\n\n<p>When a user submits a query, the VideoAgent retrieves, analyzes, and reasons across the indexed video content using specialized planner and critic tools.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"planner-tools-1\">Planner tools<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>get_video_analysis<\/strong>: Finds the most relevant video, provides a summary, and lists detected objects&nbsp;<\/li>\n\n\n\n<li><strong>get_context<\/strong>: Retrieves contextual information and relevant chapters from the Azure AI Search index<\/li>\n\n\n\n<li><strong>get_relevant_frames<\/strong>: Selects key frames most relevant to the user query<\/li>\n\n\n\n<li><strong>query_frame<\/strong>: Performs detailed visual and textual reasoning over selected frames<\/li>\n\n\n\n<li><strong>get_context<\/strong> and <strong>get_relevant_frames<\/strong> work in tandem to ensure that reasoning begins from the most semantically relevant evidence<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"critic-tools-1\">Critic tool<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>critic_tool<\/strong>: Evaluates the reasoning output for temporal alignment, factual accuracy, and coherence between visual and textual modalities<\/li>\n<\/ul>\n\n\n\n<p>This two-phase design, which involves&nbsp;structured ingestion followed by agentic reasoning, enables MMCTAgent to deliver accurate, interpretable insights for long information-dense videos.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"imageagent-structured-reasoning-for-static-visuals\">ImageAgent: Structured reasoning for static visuals<\/h3>\n\n\n\n<p>While the VideoAgent handles temporal reasoning across long-form videos, the ImageAgent applies the same Planner\u2013Critic paradigm to static visual analysis. It performs modular, tool-based reasoning over images, combining perception tools for recognition, detection, and optical character recognition with language-based reasoning for interpretation and explanation.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"planner-tools\">Planner tools<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>vit_tool<\/strong>: Leverages Vision Transformer (ViT) or Vision Languague Model (VLM) for high-level visual understanding and description&nbsp;<\/li>\n\n\n\n<li><strong>recog_tool<\/strong>: Performs scene, face, and object recognition<\/li>\n\n\n\n<li><strong>object_detection_tool<\/strong>: Localizes and labels entities within an image<\/li>\n\n\n\n<li><strong>ocr_tool<\/strong>: Extracts embedded text from visual elements<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"critic-tool\">Critic tool<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>critic_tool<\/strong>: Validates the Planner\u2019s conclusions for factual alignment and consistency, refining the final response&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>This lightweight ImageAgent provides fine-grained, explainable reasoning over image collections\u2014supporting visual question answering, content inspection, and multimodal retrieval\u2014while maintaining architectural symmetry with the VideoAgent.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluation-results\">Evaluation Results&nbsp;<\/h2>\n\n\n\n<p>To assess the effectiveness of MMCTAgent, we evaluated both the ImageAgent and VideoAgent with multiple base LLM models and a range of benchmark datasets and real-world scenarios. Some key results are presented here.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Image Datasets<\/th><th>GPT-4V<\/th><th>MMCT with GPT-4V<\/th><th>GPT4o<\/th><th>MMCT with GPT-4o<\/th><th>GPT-5<\/th><th>MMCT with GPT-5<\/th><\/tr><\/thead><tbody><tr><td>MM-Vet [1]<\/td><td>60.20<\/td><td>74.24<\/td><td>77.98<\/td><td>79.36<\/td><td>80.51<\/td><td>81.65<\/td><\/tr><tr><td>MMMU [2]<\/td><td>56.80<\/td><td>63.57<\/td><td>69.10<\/td><td>73.00<\/td><td>84.20<\/td><td>85.44<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Video Datasets<\/th><th>GPT4o<\/th><th>MMCT with GPT-4o<\/th><\/tr><\/thead><tbody><tr><td>VideoMME [3]<\/td><td>72.10<\/td><td>76.70<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>MMCTAgent enhances base model performance by augmenting their capabilities with appropriate tools such as object detection and optical character recognition (OCR) for weaker models, or domain-specific tools for stronger models, thereby leading to substantial improvements. For example, integrating these tools raised GPT-4V\u2019s accuracy from 60.20% to 74.24% on MM-Vet dataset. Additionally, the configurable Critic agent provides additional validation, which is especially valuable in critical domains. The additional evaluation results are available <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/MMCTAgent\/blob\/main\/EVALUATIONS.md\" target=\"_blank\" rel=\"noopener noreferrer\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"takeaways-and-next-steps\">Takeaways and next steps<\/h2>\n\n\n\n<p>MMCTAgent demonstrates a scalable agentic approach to multimodal reasoning with a Planner\u2013Critic architecture. Its unified multimodal design supports both image and video pipelines, while the extensible toolchain enables rapid integration of domain-specific tools and capabilities. It provides Azure-native deployment and supports configurability within the broader open-source ecosystem.<\/p>\n\n\n\n<p>Looking ahead, we aim to improve efficiency and adaptability in retrieval and reasoning workflows, and to extend MMCTAgent\u2019s applications beyond current agricultural evaluations, exploring new real-world domains through initiatives like <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-gecko\" target=\"_blank\" rel=\"noreferrer noopener\">Project Gecko<\/a> to advance the creation of accessible, innovative multimodal applications for people around the globe.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>We would like to thank our team members for their valuable contributions to this work: Aman Patkar, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ogbemie\" target=\"_blank\" rel=\"noreferrer noopener\">Ogbemi Ekwejunor-Etchie<\/a>, Somnath Kumar, Soumya De, and Yash Gadhia.\u202f<\/p>\n\n\n\n<p><strong>References<\/strong><strong><\/strong>&nbsp;<\/p>\n\n\n\n<p>[1] W.\u202fYu, Z.\u202fYang, L.\u202fLi, J.\u202fWang, K.\u202fLin, Z.\u202fLiu, X.\u202fWang, and L.\u202fWang. \u201cMM-VET: Evaluating large multimodal models for integrated capabilities\u201d, 2023.&nbsp;<\/p>\n\n\n\n<p>[2] X.\u202fYue, Y.\u202fNi, K.\u202fZhang, T.\u202fZheng, R.\u202fLiu, G.\u202fZhang, S.\u202fStevens, D.\u202fJiang, W.\u202fRen, Y.\u202fSun, C.\u202fWei, B.\u202fYu, R.\u202fYuan, R.\u202fSun, M.\u202fYin, B.\u202fZheng, Z.\u202fYang, Y.\u202fLiu, W.\u202fHuang, H.\u202fSun, Y.\u202fSu, and W.\u202fChen. \u201cMMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI\u201d, 2023.&nbsp;<\/p>\n\n\n\n<p>[3] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et\u202fal. \u201cVideo-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis\u201d, 2024.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>MMCTAgent enables dynamic multimodal reasoning with iterative planning and reflection. Built on Microsoft\u2019s AutoGen framework, it integrates language, vision, and temporal understanding for complex tasks like long video and image analysis.<\/p>\n","protected":false},"author":43868,"featured_media":1153930,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Akshay Nambi","user_id":"38169"},{"type":"user_nicename","value":"Kavyansh Chourasia","user_id":"43029"},{"type":"user_nicename","value":"Tanuja Ganu","user_id":"38883"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142,269145],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1153693","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river","msr-post-option-pinned-for-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199562],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[1119384],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Akshay Nambi","user_id":38169,"display_name":"Akshay Nambi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akshayn\/\" aria-label=\"Visit the profile page for Akshay Nambi\">Akshay Nambi<\/a>","is_active":false,"last_first":"Nambi, Akshay","people_section":0,"alias":"akshayn"},{"type":"user_nicename","value":"Kavyansh Chourasia","user_id":43029,"display_name":"Kavyansh Chourasia","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kchourasia\/\" aria-label=\"Visit the profile page for Kavyansh Chourasia\">Kavyansh Chourasia<\/a>","is_active":false,"last_first":"Chourasia, Kavyansh","people_section":0,"alias":"kchourasia"},{"type":"user_nicename","value":"Tanuja Ganu","user_id":38883,"display_name":"Tanuja Ganu","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/taganu\/\" aria-label=\"Visit the profile page for Tanuja Ganu\">Tanuja Ganu<\/a>","is_active":false,"last_first":"Ganu, Tanuja","people_section":0,"alias":"taganu"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Three white icons on a blue-to-purple gradient background: the first icon shows an image\/photo; the second icon depicts a computer monitor with vertical bars; the third icon displays three connected circles with user silhouettes.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/MMCTAgent-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akshayn\/\" title=\"Go to researcher profile for Akshay Nambi\" aria-label=\"Go to researcher profile for Akshay Nambi\" data-bi-type=\"byline author\" data-bi-cN=\"Akshay Nambi\">Akshay Nambi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kchourasia\/\" title=\"Go to researcher profile for Kavyansh Chourasia\" aria-label=\"Go to researcher profile for Kavyansh Chourasia\" data-bi-type=\"byline author\" data-bi-cN=\"Kavyansh Chourasia\">Kavyansh Chourasia<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/taganu\/\" title=\"Go to researcher profile for Tanuja Ganu\" aria-label=\"Go to researcher profile for Tanuja Ganu\" data-bi-type=\"byline author\" data-bi-cN=\"Tanuja Ganu\">Tanuja Ganu<\/a>","formattedDate":"November 12, 2025","formattedExcerpt":"MMCTAgent enables dynamic multimodal reasoning with iterative planning and reflection. Built on Microsoft\u2019s AutoGen framework, it integrates language, vision, and temporal understanding for complex tasks like long video and image analysis.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1153693","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1153693"}],"version-history":[{"count":71,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1153693\/revisions"}],"predecessor-version":[{"id":1155562,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1153693\/revisions\/1155562"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1153930"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1153693"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1153693"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1153693"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1153693"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1153693"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1153693"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1153693"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1153693"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1153693"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1153693"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1153693"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}