{"id":1163539,"date":"2026-03-12T09:38:45","date_gmt":"2026-03-12T16:38:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1163539"},"modified":"2026-04-22T11:48:41","modified_gmt":"2026-04-22T18:48:41","slug":"systematic-debugging-for-ai-agents-introducing-the-agentrx-framework","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/systematic-debugging-for-ai-agents-introducing-the-agentrx-framework\/","title":{"rendered":"Systematic debugging for AI agents: Introducing the AgentRx framework"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1.jpg\" alt=\"Three white line icons, showing network, workflow, and bug\u2011analysis icons, on a blue\u2011to\u2011purple gradient background.\" class=\"wp-image-1163547\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__inner wp-block-msr-immersive-section__inner--narrow\">\n\t\t\t<div class=\"wp-block-columns mb-10 pb-1 pr-1 is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\" style=\"box-shadow:var(--wp--preset--shadow--outlined)\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h2 class=\"wp-block-heading h3\" id=\"at-a-glance\">At a glance<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.<\/li>\n\n\n\n<li><strong>Solution:<\/strong> <strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AgentRx\/Repo\" type=\"link\" id=\"https:\/\/aka.ms\/AgentRx\/Repo\" target=\"_blank\" rel=\"noopener noreferrer\">AgentRx<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong> pinpoints the <em>first unrecoverable (\u201ccritical failure\u201d) step<\/em> by synthesizing <strong>guarded, executable constraints<\/strong> from tool schemas and domain policies, then logging evidence-backed violations step-by-step.<\/li>\n\n\n\n<li><strong>Benchmark + taxonomy:<\/strong> We release <strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AgentRx\/Dataset\" type=\"link\" id=\"https:\/\/aka.ms\/AgentRx\/Dataset\" target=\"_blank\" rel=\"noopener noreferrer\">AgentRx Benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong> with <strong>115<\/strong> manually annotated failed trajectories across <strong>\u03c4-bench<\/strong>, <strong>Flash<\/strong>, and <strong>Magentic-One<\/strong>, plus a grounded nine<strong>-category failure taxonomy<\/strong>.<\/li>\n\n\n\n<li><strong>Results + release:<\/strong> AgentRx improves failure localization (<strong>+23.6%<\/strong>) and root-cause attribution (<strong>+22.9%<\/strong>) over prompting baselines, and we are open-sourcing the framework and dataset.<\/li>\n<\/ul>\n<\/div>\n<\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<p>As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: <strong>transparency.<\/strong><\/p>\n\n\n\n<p>When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.<\/p>\n\n\n\n<p>Today, we are excited to announce the open-source release of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AgentRx\/Repo\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>AgentRx<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, an automated, domain-agnostic framework designed to pinpoint the &#8220;critical failure step&#8221; in agent trajectories. Alongside the framework, we are releasing the <strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AgentRx\/Dataset\" type=\"link\" id=\"https:\/\/aka.ms\/AgentRx\/Dataset\" target=\"_blank\" rel=\"noopener noreferrer\">AgentRx Benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong>, a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1920 \/ 1080;\" width=\"1920\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-framework.mp4\"><\/video><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-challenge-why-ai-agents-are-hard-to-debug\">The challenge: Why AI agents are hard to debug<\/h2>\n\n\n\n<p>Modern AI agents are often:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Long-horizon:<\/strong> They perform dozens of actions over extended periods.<\/li>\n\n\n\n<li><strong>Probabilistic:<\/strong> The same input might lead to different outputs, making reproduction difficult.<\/li>\n\n\n\n<li><strong>Multi-agent:<\/strong> Failures can be &#8220;passed&#8221; between agents, masking the original root cause.<\/li>\n<\/ul>\n\n\n\n<p>Traditional success metrics (like \u201cDid the task finish?\u201d) don\u2019t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introducing-agentrx-an-automated-diagnostic-prescription\">Introducing AgentRx: An automated diagnostic &#8220;prescription&#8221;<\/h2>\n\n\n\n<p><strong>AgentRx<\/strong> (short for &#8220;Agent Diagnosis&#8221;) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to &#8220;guess&#8221; the error, AgentRx uses a structured, multi-stage pipeline:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Trajectory normalization:<\/strong> Heterogeneous logs from different domains are converted into a common intermediate representation.<\/li>\n\n\n\n<li><strong>Constraint synthesis:<\/strong> The framework automatically generates executable constraints based on tool schemas (e.g., &#8220;The API must return a valid JSON response&#8221;) and domain policies (e.g., &#8220;Do not delete data without user confirmation&#8221;).<\/li>\n\n\n\n<li><strong>Guarded evaluation:<\/strong> AgentRx evaluates constraints step-by-step, checking each constraint only when its <em>guard condition<\/em> applies, and produces an <strong>auditable validation log<\/strong> of evidence-backed violations.<\/li>\n\n\n\n<li><strong>LLM-based judging:<\/strong> Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the <strong>Critical Failure Step<\/strong>\u2014the first unrecoverable error.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1197\" height=\"955\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx.png\" alt=\"Flowchart illustrating an agent failure attribution pipeline. In the upper left, a blue rounded box labeled \u201cTask Context\u201d contains three stacked inputs: \u201cDomain Policy,\u201d \u201cTool Schema,\u201d and \u201cTrajectory.\u201d A downward arrow leads into a large yellow rounded rectangle representing the validation pipeline. Inside this area, a green box labeled \u201cConstraint Generator\u201d feeds into a blue box labeled \u201cConstraint Checker.\u201d To their right is a JSON-like constraint specification with fields such as assertion_name: \"tshirt_count_matches\", taxonomy_targets: \"MisinterpretationOfToolOutput\", type: \"RELATIONAL_POST\", trigger: { step_index: 7, agent_name: \"assistant\" }, and a check hint stating that the assistant must compute the T-shirt count from the correct field of the get_product_details result. At the far right, a scroll-shaped box labeled \u201cValidation log (Violated Constraints + Taxonomy checklist)\u201d receives output from the checker. Above the yellow region, under the title \u201cAgent Failure Attribution,\u201d a trajectory snippet shows a tool call to get_product_details followed by an assistant message highlighted in a red box: \u201cThere are 11 available T-shirt options \u2026\u201d. Red text below labels this as \u201cMisinterpretation of Tool Output @ Step 7.\u201d A dark upward arrow connects the validation log back to the highlighted assistant message, showing that the violated constraint is used to attribute the failure to that specific step. A horizontal arrow along the bottom asks, \u201cConstraint violated?\u201d \" class=\"wp-image-1163550\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx.png 1197w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx-300x239.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx-1024x817.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx-768x613.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/agentrx-226x180.png 226w\" sizes=\"auto, (max-width: 1197px) 100vw, 1197px\" \/><figcaption class=\"wp-element-caption\"><em>The AgentRx workflow: <\/em>Given a failed trajectory, tool schemas, and domain policy, AgentRx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the <strong>critical failure step<\/strong> and <strong>root-cause category<\/strong>.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-new-benchmark-for-agent-failures\">A New Benchmark for Agent Failures<\/h2>\n\n\n\n<p>To evaluate AgentRx, we developed a manually annotated benchmark consisting of <strong>115 failed trajectories<\/strong> across three complex domains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>\u03c4-bench:<\/strong> Structured API workflows for retail and service tasks.<\/li>\n\n\n\n<li><strong>Flash:<\/strong> Real-world incident management and system troubleshooting.<\/li>\n\n\n\n<li><strong>Magentic-One:<\/strong> Open-ended web and file tasks using a generalist multi-agent system.<\/li>\n<\/ul>\n\n\n\n<p>Using a grounded-theory approach, we derived a nine<strong>-category failure taxonomy<\/strong> that generalizes across these domains. This taxonomy helps developers distinguish between a <strong>&#8220;Plan Adherence Failure&#8221;<\/strong> (where the agent ignored its own steps) and an <strong>&#8220;Invention of New Information&#8221;<\/strong> (hallucination).<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Taxonomy Category<\/th><th>Description<\/th><\/tr><\/thead><tbody><tr><td>Plan Adherence Failure<\/td><td>Ignored required steps \/ did extra unplanned actions<\/td><\/tr><tr><td>Invention of New Information<\/td><td>Altered facts not grounded in trace\/tool output<\/td><\/tr><tr><td>Invalid Invocation<\/td><td>Tool call malformed \/ missing args \/ schema-invalid<\/td><\/tr><tr><td>Misinterpretation of Tool Output<\/td><td>Read tool output incorrectly; acted on wrong assumptions<\/td><\/tr><tr><td>Intent\u2013Plan Misalignment<\/td><td>Misread user goal\/constraints and planned wrongly<\/td><\/tr><tr><td>Under-specified User Intent<\/td><td>Could not proceed because required info wasn\u2019t available<\/td><\/tr><tr><td>Intent Not Supported<\/td><td>No available tool can do what\u2019s being asked<\/td><\/tr><tr><td>Guardrails Triggered<\/td><td>Execution blocked by safety\/access restrictions<\/td><\/tr><tr><td>System Failure<\/td><td>Connectivity\/tool endpoint failures<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1512\" height=\"854\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines.png\" alt=\"Two-column taxonomy table with a dark blue header row labeled \u201cTaxonomy Category\u201d and \u201cDescription.\u201d The rows define nine agent failure types: Plan Adherence Failure, Invention of New Information, Invalid Invocation, Misinterpretation of Tool Output, Intent\u2013Plan Misalignment, Under-specified User Intent, Intent Not Supported, Guardrails Triggered, and System Failure. Their descriptions explain, respectively, skipped or extra actions, invented facts, malformed tool calls, incorrect reading of tool outputs, wrong planning from misunderstood intent, inability to proceed due to missing information, lack of tool support, blocking by safety or access controls, and connectivity or endpoint failures. \" class=\"wp-image-1163552\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines.png 1512w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines-1024x578.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines-768x434.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines-240x136.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/failure_timelines-640x360.png 640w\" sizes=\"auto, (max-width: 1512px) 100vw, 1512px\" \/><figcaption class=\"wp-element-caption\"><em>Analysis of failure density across domains. In multi-agent systems like <a href=\"https:\/\/labs.ai.azure.com\/projects\/magentic-one\/\" type=\"link\" id=\"https:\/\/labs.ai.azure.com\/projects\/magentic-one\/\" target=\"_blank\" rel=\"noreferrer noopener\">Magentic-One<\/a>, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach.<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-results\">Key Results<\/h2>\n\n\n\n<p>In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>+23.6% absolute improvement<\/strong> in failure localization accuracy.<\/li>\n\n\n\n<li><strong>+22.9% improvement<\/strong> in root-cause attribution.<\/li>\n<\/ul>\n\n\n\n<p>By providing the &#8220;why&#8221; behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"join-the-community-open-source-release\">Join the Community: Open Source Release<\/h2>\n\n\n\n<p>We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Read the Paper:<\/strong> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/agentrx-diagnosing-ai-agent-failures-from-execution-trajectories\/\" type=\"link\" id=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/agentrx-diagnosing-ai-agent-failures-from-execution-trajectories\/\">AgentRx: Diagnosing AI Agent Failures from Execution Trajectories<\/a><\/li>\n\n\n\n<li><strong>Explore the Code & Data:<\/strong> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AgentRx\/Code\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/aka.ms\/AgentRx\/Code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\n\n\n<p>We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h3>\n\n\n\n<p>We would like to thank Avaljot Singh and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sumann\/\">Suman Nath<\/a> for contributing to this project.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or [&hellip;]<\/p>\n","protected":false},"author":44124,"featured_media":1163547,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Shraddha Barke","user_id":"43605"},{"type":"user_nicename","value":"Arnav Goyal","user_id":"44095"},{"type":"user_nicename","value":"Alind Khare","user_id":"44098"},{"type":"user_nicename","value":"Chetan Bansal","user_id":"31394"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1163539","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144812,793670],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Shraddha Barke","user_id":43605,"display_name":"Shraddha Barke","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sbarke\/\" aria-label=\"Visit the profile page for Shraddha Barke\">Shraddha Barke<\/a>","is_active":false,"last_first":"Barke, Shraddha","people_section":0,"alias":"sbarke"},{"type":"user_nicename","value":"Arnav Goyal","user_id":44095,"display_name":"Arnav Goyal","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/t-arnavgoyal\/\" aria-label=\"Visit the profile page for Arnav Goyal\">Arnav Goyal<\/a>","is_active":false,"last_first":"Goyal, Arnav","people_section":0,"alias":"t-arnavgoyal"},{"type":"user_nicename","value":"Alind Khare","user_id":44098,"display_name":"Alind Khare","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alindkhare\/\" aria-label=\"Visit the profile page for Alind Khare\">Alind Khare<\/a>","is_active":false,"last_first":"Khare, Alind","people_section":0,"alias":"alindkhare"},{"type":"user_nicename","value":"Chetan Bansal","user_id":31394,"display_name":"Chetan Bansal","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chetanb\/\" aria-label=\"Visit the profile page for Chetan Bansal\">Chetan Bansal<\/a>","is_active":false,"last_first":"Bansal, Chetan","people_section":0,"alias":"chetanb"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Three white line icons, showing network, workflow, and bug\u2011analysis icons, on a blue\u2011to\u2011purple gradient background.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/AgentRx-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sbarke\/\" title=\"Go to researcher profile for Shraddha Barke\" aria-label=\"Go to researcher profile for Shraddha Barke\" data-bi-type=\"byline author\" data-bi-cN=\"Shraddha Barke\">Shraddha Barke<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/t-arnavgoyal\/\" title=\"Go to researcher profile for Arnav Goyal\" aria-label=\"Go to researcher profile for Arnav Goyal\" data-bi-type=\"byline author\" data-bi-cN=\"Arnav Goyal\">Arnav Goyal<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alindkhare\/\" title=\"Go to researcher profile for Alind Khare\" aria-label=\"Go to researcher profile for Alind Khare\" data-bi-type=\"byline author\" data-bi-cN=\"Alind Khare\">Alind Khare<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chetanb\/\" title=\"Go to researcher profile for Chetan Bansal\" aria-label=\"Go to researcher profile for Chetan Bansal\" data-bi-type=\"byline author\" data-bi-cN=\"Chetan Bansal\">Chetan Bansal<\/a>","formattedDate":"March 12, 2026","formattedExcerpt":"As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163539","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/44124"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1163539"}],"version-history":[{"count":13,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163539\/revisions"}],"predecessor-version":[{"id":1163721,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1163539\/revisions\/1163721"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1163547"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1163539"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1163539"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1163539"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1163539"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1163539"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1163539"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1163539"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1163539"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1163539"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1163539"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1163539"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}