Image of peppers growing on a vine.

Project Gecko

MMCTAgent: Enabling multimodal reasoning over large video and image collections

Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis. Existing models typically perform single-pass inference, producing one-shot answers. This limits their ability to handle tasks that require temporal reasoning, cross-modal grounding, and iterative refinement.

To meet these challenges, we developed the Multi-modal Critical Thinking Agent, or MMCTAgent, for structured reasoning over long-form video and image data, available on GitHub (opens in new tab) and featured on Azure AI Foundry Labs (opens in new tab).

Built on AutoGen, Microsoft’s open-source multi-agent system, MMCTAgent provides multimodal question-answering with a Planner–Critic architecture. This design enables planning, reflection, and tool-based reasoning, bridging perception and deliberation in multimodal tasks. It links language, vision, and temporal understanding, transforming static multimodal tasks into dynamic reasoning workflows.  

Unlike conventional models that produce one-shot answers, MMCTAgent has modality-specific agents, including ImageAgent and VideoAgent, which include tools like get_relevant_query_frames() or object_detection-tool(). These agents perform deliberate, iterative reasoning—selecting the right tools for each modality, evaluating intermediate results, and refining conclusions through a Critic loop. This enables MMCTAgent to analyze complex queries across long videos and large image libraries with explainability, extensibility, and scalability.