{"id":1160129,"date":"2026-01-20T09:00:00","date_gmt":"2026-01-20T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1160129"},"modified":"2026-03-18T18:07:06","modified_gmt":"2026-03-19T01:07:06","slug":"multimodal-reinforcement-learning-with-agentic-verifier-for-ai-agents","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/multimodal-reinforcement-learning-with-agentic-verifier-for-ai-agents\/","title":{"rendered":"Argos: Multimodal reinforcement learning with agentic verifier for AI agents"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1.jpg\" alt=\"Diagram showing visual, audio, and document icons feeding into a central network icon of connected people, which then leads to a checkmark symbol, all on a blue\u2011to\u2011purple gradient background.\" class=\"wp-image-1160195\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__inner wp-block-msr-immersive-section__inner--narrow\">\n\t\t\t<div class=\"wp-block-columns mb-10 pb-1 pr-1 is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\" style=\"box-shadow:var(--wp--preset--shadow--outlined)\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h2 class=\"wp-block-heading h3\" id=\"at-a-glance\">At a glance<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Today&#8217;s multimodal AI systems&nbsp;can&nbsp;give&nbsp;answers that sound right but&nbsp;may not be&nbsp;grounded in what they&nbsp;actually&nbsp;observe&nbsp;over time, leading to unpredictable errors and safety risks in real-world settings.<\/li>\n\n\n\n<li>Argos is a verification framework for multimodal reinforcement learning that trains models by rewarding not just correct answers, but correct answers grounded in visual and temporal evidence, using automated verification rather than human labeling.&nbsp;It selects the appropriate specialized tools for each answer&nbsp;based on what needs to be verified.&nbsp;<\/li>\n\n\n\n<li>Models trained with Argos show stronger spatial reasoning, far fewer visual hallucinations, more stable learning dynamics, and better performance on robotics and real-world tasks while requiring fewer training samples.<\/li>\n<\/ul>\n<\/div>\n<\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<p>Over the past few years, AI systems have become much better at discerning images, generating language, and performing tasks within physical and virtual environments. Yet they still fail in ways that are hard to predict and even harder to fix. A robot might try to grasp a tool when the object is visibly blocked, or a visual assistant integrated into smart glasses might describe objects that aren\u2019t actually present.<\/p>\n\n\n\n<p>These errors often arise because today\u2019s multimodal agents are trained to generate outputs that are plausible rather than grounded in the actual information they receive from their environment. As a result, a model\u2019s output can seem correct while relying on incorrect information. As AI systems are increasingly used to navigate 3D spaces and make decisions in real-world settings, this gap can be a safety and reliability concern.<\/p>\n\n\n\n<p>To tackle this challenge, we posed the question: How can we train AI agents to generate correct answers and take appropriate actions for the right reasons so that their behavior is reliable even as the environment or tasks change?<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multimodal-reinforcement-learning-with-agentic-verifier-for-ai-agents\/\">Argos<\/a> represents a novel answer to this challenge. It\u2019s an agentic verification framework designed to improve the reliability of reinforcement learning in multimodal models.&nbsp;Reinforcement learning is a training method where AI models learn by receiving rewards for desired behaviors and penalties for undesired ones, gradually improving their performance through trial and error.<\/p>\n\n\n\n<p>Rather than rewarding only correct behaviors, Argos evaluates <em>how<\/em> those behaviors were produced. It draws on a pool of larger, more capable teacher models and rule-based checks to verify two things: first, that the objects and events a model references actually exist in its input, and second, that the model\u2019s reasoning aligns with what it observes. Argos rewards the model when both conditions are met. In practice, these rewards help curate high-quality training data and guide the model\u2019s further training.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-argos-works\">How Argos works<strong><\/strong><\/h2>\n\n\n\n<p>Argos functions as a verification layer on top of an existing multimodal model. Given an image or video, a task or query, and information about the model\u2019s reasoning and output, Argos identifies where the model indicates objects are located in the image, when it indicates events occur in a video, and what action or answer it produces.<\/p>\n\n\n\n<p>Argos then applies specialized tools tailored to the specific content to evaluate and score three aspects of the model\u2019s output. It checks whether the answer is correct, whether referenced objects and events appear at the indicated locations and times, and whether the reasoning is consistent with the visual evidence and the answer (Figure 1).<\/p>\n\n\n\n<p>These scores are combined using a gated aggregation function, a method that dynamically adjusts the importance of different scores. It emphasizes reasoning checks only when the final output is correct. This design prevents unreliable feedback from dominating training and produces a stable reward signal for&nbsp;reinforcement learning.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1255\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-scaled.jpg\" alt=\"Figure 1 shows an overview of Argos, an agentic verifier for multimodal reinforcement learning and its downstream applications. The left half of the figure illustrates Argos verifying model responses to visual questions. The left example counts dogs in an image, with red dots marking the referenced dogs and a visual grounding score; another example shows a bathroom scene where the agent reasons whether it can open the door, with an accuracy score. Below these, a blue bar titled \u201cArgos verifier\u201d feeds into icons representing multiple tools, including Grounding DINO, SAM-2, a pointing-hand evaluator, string matching, and a language model score, where their outputs combine into grounding and accuracy scores. The right half of the figure depicts three categories of downstream tasks powered by this supervision: robotic manipulation (a robot arm interacting with objects on a table), high-level task planning and completion (placing toilet paper on the back of a toilet and putting a bowl on a coffee table), and spatial reasoning (answering a viewpoint-based navigation question using room images). The overall message is that dense, grounded verification enables stronger agent performance on complex, real-world tasks.\" class=\"wp-image-1160145\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-scaled.jpg 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-300x147.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-1024x502.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-768x376.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-1536x753.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-2048x1004.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos_agentic_verifier-240x118.jpg 240w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Argos selects different specialized tools to verify and score the accuracy of referenced points and events in the agent\u2019s reasoning.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"using-argos-to-curate-data-for-supervised-fine-tuning\">Using Argos to curate data for supervised fine-tuning<\/h2>\n\n\n\n<p>Argos also helps curate high-quality training data to provide the model with a strong foundation in grounded reasoning. Before the reinforcement learning stage begins, Argos uses a multi-stage process to generate data that is explicitly tied to visual locations and time intervals.<\/p>\n\n\n\n<p>In the first stage, Argos identifies the objects, actions, and events that are relevant to a task and links them to specific locations in images or specific moments in videos. These references are overlaid on images and selected video frames. Next, a reasoning model generates step-by-step explanations that refer to these visual locations and time spans.<\/p>\n\n\n\n<p>Finally, Argos evaluates each generated example for accuracy and visual grounding, filtering out low-quality training data and retaining only data that is both correct and well-grounded in visual input. The resulting dataset is then used in an initial training phase, where the model learns to generate reasoning steps before producing its final output. This process is illustrated in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"450\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/data-curation-animation-gif.gif\" alt=\"Figure 2 illustrates the Argos scoring pipeline for both images and videos. On the left, two examples show an image of a living room and a short video clip, each paired with a question and a free-form model response (e.g., estimating the distance between two lamps, or describing why a person failed to pour oil). In the middle, an \u201cAgentic Verifier\u201d column parses each response into structured elements: spatial 2D points indicating the referenced object and pixel coordinates, temporal segments for the relevant video frames, a reasoning-quality panel that combines the image\/video, question, and response, and a final-answer panel comparing the predicted answer to ground truth. Below, a row of teacher models and scoring functions, such as Grounding DINO, SAM-2, a pointing-hand metric, string matching, relative accuracy, and a language model score, take these extracted elements as input to produce separate scores. On the right, arrows labeled \u201cAction\u201d and \u201cScore\u201d show how the verifier adaptively selects which teachers to call and then aggregates their outputs via a gated aggregation function into a single reward signal for training. \" class=\"wp-image-1160147\"\/><figcaption class=\"wp-element-caption\">Figure 2. Argos generates step-by-step reasoning grounded in image locations and video timestamps then filters out low-quality training data.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluation\">Evaluation<\/h2>\n\n\n\n<p>Building on this foundation in grounded reasoning, we further trained the model using reinforcement learning guided by Argos and evaluated its performance across a range of benchmarks. On spatial reasoning tasks, the Argos-trained model outperformed both the base model Qwen2.5-VL-7B and the stronger Video-R1 baseline across challenging 3D scenarios and multi-view tasks. Models trained with Argos also showed a substantial reduction of hallucinations compared with both standard chain-of-thought prompting and reinforcement learning baselines.<\/p>\n\n\n\n<p>Finally, we evaluated the model in robotics and other real-world task settings, focusing on high-level planning and fine-grained control. Models trained with Argos performed better on complex, multi-step tasks. Notably, these improvements were achieved using fewer training samples than existing approaches, highlighting the importance of reward design in producing more capable and data-efficient agents. Figure 3 illustrates some of these findings.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1331\" height=\"406\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3.png\" alt=\"Figure 3 shows two side-by-side line charts comparing an Agentic model (dashed line) that uses the Argos verifier with a Non-Agentic model (solid line) trained only with an outcome reward. The left plot, \u201cResponse Accuary,\u201d tracks response accuracy versus RL step (0, 5, 10, 15). Both models start near 0.54 accuracy, but the Agentic curve slightly rises and then stays roughly flat, while the Non-agentic curve steadily declines to about 0.50. The right plot, \u201cVisual Grounding Acc,\u201d shows visual grounding accuracy over the same steps: the Agentic curve increases monotonically from about 0.39 to just above 0.5, whereas the Non-Agentic curve initially rises slightly and then drops sharply to about 0.1. Together, the plots illustrate that Argos stabilizes answer accuracy and significantly improves visual grounding, while the non-agentic model\u2019s performance and grounding collapse over training.\" class=\"wp-image-1160369\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3.png 1331w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3-300x92.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3-1024x312.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3-768x234.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-blog_fig3-240x73.png 240w\" sizes=\"auto, (max-width: 1331px) 100vw, 1331px\" \/><figcaption class=\"wp-element-caption\">Figure 3.&nbsp;Performance of&nbsp;Argos&nbsp;compared with&nbsp;baseline models&nbsp;on the task of visual hallucination detection&nbsp;(left)&nbsp;and&nbsp;embodied&nbsp;task planning and completion&nbsp;(right).&nbsp;<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-argos-shapes-reinforcement-learning\">How Argos shapes reinforcement learning<\/h3>\n\n\n\n<p>To understand how Argos affects learning, we took the same vision-language model that had been trained on our curated dataset and fine-tuned it using&nbsp;reinforcement learning in two different ways. In one approach, Argos was an agentic verifier, checking the correctness of outputs and the quality of reasoning. In the other, the model received feedback only on whether its answers were correct.<\/p>\n\n\n\n<p>We evaluated both versions on 1,500 samples from a new dataset and tracked their performance throughout the learning process (Figure 4). Although they started at similar levels, the model without Argos quickly got worse. Its accuracy steadily declined, and it increasingly gave answers that ignored what was in the videos. It learned to game the system by producing answers that seemed correct without grounding them in visual evidence.<\/p>\n\n\n\n<p>The model trained with Argos showed the opposite pattern. Accuracy improved steadily, and the model became better at linking its reasoning to what appeared in the videos. This difference highlights the value of verification: when training rewards both correct outputs and sound reasoning based on visual and temporal evidence, models learn to be more reliable rather than simply finding shortcuts to high scores.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1996\" height=\"550\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final.jpg\" alt=\"Figure 4 shows two side-by-side line charts comparing an Agentic model (dashed line) that uses the Argos verifier with a Non-Agentic model (solid line) trained only with an outcome reward. The left plot, \u201cResponse Accuary,\u201d tracks response accuracy versus RL step (0, 5, 10, 15). Both models start near 0.54 accuracy, but the Agentic curve slightly rises and then stays roughly flat, while the Non-agentic curve steadily declines to about 0.50. The right plot, \u201cVisual Grounding Acc,\u201d shows visual grounding accuracy over the same steps: the Agentic curve increases monotonically from about 0.39 to just above 0.5, whereas the Non-Agentic curve initially rises slightly and then drops sharply to about 0.1. Together, the plots illustrate that Argos stabilizes answer accuracy and significantly improves visual grounding, while the non-agentic model\u2019s performance and grounding collapse over training. \" class=\"wp-image-1160428\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final.jpg 1996w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final-300x83.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final-1024x282.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final-768x212.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final-1536x423.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/evaluation-fig4-final-240x66.jpg 240w\" sizes=\"auto, (max-width: 1996px) 100vw, 1996px\" \/><figcaption class=\"wp-element-caption\">Figure 4.&nbsp;Comparison of&nbsp;response accuracy changes with and without Argos&nbsp;across&nbsp;two model versions&nbsp;(left) and&nbsp;differences in&nbsp;visual grounding accuracy&nbsp;over training for both&nbsp;versions&nbsp;(right).<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"potential-impact-and-looking-forward\">Potential impact and looking forward<\/h2>\n\n\n\n<p>This research points toward a different way of building AI agents for real-world applications. Rather than fixing errors after they occur, it focuses on training agents to systematically anchor their reasoning in what they actually receive as input throughout the training process.<\/p>\n\n\n\n<p>The potential applications span many domains. A visual assistant for a self-driving car that verifies what\u2019s actually in an image is less likely to report phantom obstacles. A system that automates digital tasks and checks each action against what\u2019s displayed on the screen is less likely to click the wrong button.<\/p>\n\n\n\n<p>As AI systems move beyond research labs into homes, factories, and offices, reliable reasoning becomes essential for safety and trust. Argos represents an early example of verification systems that evolve alongside the AI models they supervise. Future verifiers could be tailored for specific fields like medical imaging, industrial simulations, and business analytics. As more advanced models and richer data sources become available, researchers can use them to improve these verification systems, providing even better guidance during training and further reducing hallucinations.<\/p>\n\n\n\n<p>We hope that this research helps move the field toward AI systems that are both capable and interpretable: agents that can explain their decisions, point to the evidence behind them, and be trained to adhere to real-world requirements and values.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"1076\" style=\"aspect-ratio: 1920 \/ 1076;\" width=\"1920\" controls poster=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos.png\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/argos-demo-video.mp4\"><\/video><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Argos improves multimodal RL by evaluating whether an agent\u2019s reasoning aligns with what it observes over time. The approach reduces visual hallucinations and produces more reliable, data-efficient agents for real-world applications.<\/p>\n","protected":false},"author":43518,"featured_media":1160195,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Reuben Tan","user_id":"43827"},{"type":"user_nicename","value":"Baolin Peng","user_id":"43779"},{"type":"user_nicename","value":"Zhengyuan Yang","user_id":"44024"},{"type":"user_nicename","value":"Oier Mees","user_id":"44070"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1160129","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Reuben Tan","user_id":43827,"display_name":"Reuben Tan","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tanreuben\/\" aria-label=\"Visit the profile page for Reuben Tan\">Reuben Tan<\/a>","is_active":false,"last_first":"Tan, Reuben","people_section":0,"alias":"tanreuben"},{"type":"user_nicename","value":"Baolin Peng","user_id":43779,"display_name":"Baolin Peng","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/baolinpeng\/\" aria-label=\"Visit the profile page for Baolin Peng\">Baolin Peng<\/a>","is_active":false,"last_first":"Peng, Baolin","people_section":0,"alias":"baolinpeng"},{"type":"user_nicename","value":"Zhengyuan Yang","user_id":44024,"display_name":"Zhengyuan Yang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhengyang\/\" aria-label=\"Visit the profile page for Zhengyuan Yang\">Zhengyuan Yang<\/a>","is_active":false,"last_first":"Yang, Zhengyuan","people_section":0,"alias":"zhengyang"},{"type":"user_nicename","value":"Oier Mees","user_id":44070,"display_name":"Oier Mees","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/oiermees\/\" aria-label=\"Visit the profile page for Oier Mees\">Oier Mees<\/a>","is_active":false,"last_first":"Mees, Oier","people_section":0,"alias":"oiermees"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Diagram showing visual, audio, and document icons feeding into a central network icon of connected people, which then leads to a checkmark symbol, all on a blue\u2011to\u2011purple gradient background.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Argos-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"January 20, 2026","formattedExcerpt":"Argos improves multimodal RL by evaluating whether an agent\u2019s reasoning aligns with what it observes over time. The approach reduces visual hallucinations and produces more reliable, data-efficient agents for real-world applications.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1160129","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43518"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1160129"}],"version-history":[{"count":59,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1160129\/revisions"}],"predecessor-version":[{"id":1166401,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1160129\/revisions\/1166401"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1160195"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1160129"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1160129"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1160129"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1160129"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1160129"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1160129"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1160129"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1160129"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1160129"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1160129"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1160129"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}