{"id":1159844,"date":"2026-01-08T11:35:12","date_gmt":"2026-01-08T19:35:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1159844"},"modified":"2026-01-14T16:06:33","modified_gmt":"2026-01-15T00:06:33","slug":"agent-pex-automated-evaluation-and-testing-of-ai-agents","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/agent-pex-automated-evaluation-and-testing-of-ai-agents\/","title":{"rendered":"Agent-Pex: Automated Evaluation and Testing of AI Agents"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-catalina-blue card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"agent-pex\">Agent-Pex<\/h1>\n\n\n\n<p>Automated evaluation and testing of AI agents<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-problem-confidence-and-reliability-in-agentic-ai\">The problem: Confidence and reliability in agentic AI<\/h2>\n\n\n\n<p>AI agents are rapidly transforming software, with projections of over a billion agents in operation by 2028. These agents, embedded in products like VS Code and M365 Copilot, perform increasingly complex tasks\u2014writing code, conducting research, and automating workflows. However, as agentic systems grow in complexity, understanding, debugging, and validating their behavior becomes a major challenge.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Opaque reasoning:<\/strong> Agents operate through multi-step traces, making it hard to know what they were supposed to do and whether they did it correctly.<\/li>\n\n\n\n<li><strong>Specification drift:<\/strong> Intended behaviors are often only partially specified, both explicitly (in prompts) and implicitly (in tool usage).<\/li>\n\n\n\n<li><strong>Testing at scale:<\/strong> Manual evaluation is infeasible for thousands of agent traces across diverse domains and models.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"our-solution-agent-pex\">Our solution: Agent-Pex<\/h3>\n\n\n\n<p>Agent-Pex is an AI-powered tool designed to systematically evaluate agentic traces and generate targeted agent tests. It builds on the insight that agent prompts and system instructions encode <em>checkable rules<\/em>\u2014partial specifications that can be extracted and used for automated evaluation.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Specification extraction:<\/strong> Agent-Pex parses agent prompts and traces to extract explicit and implicit behavioral rules (e.g., \u201cWhen asked for your name, respond with \u2018GitHub Copilot\u2019\u201d).<\/li>\n\n\n\n<li><strong>Automated evaluation:<\/strong> Given a trace and a specification, Agent-Pex determines if any rules were violated, providing detailed reasoning and scoring (e.g., output_spec_eval_score: 95.0).<\/li>\n\n\n\n<li><strong>Scalable analysis:<\/strong> Agent-Pex supports evaluation across thousands of traces, including those from commercial agents (VS Code) and academic benchmarks (Tau\u00b2).<\/li>\n<\/ul>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1057\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram.png\" alt=\"Agent-Pex scaling diagram\" class=\"wp-image-1160228\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram.png 1057w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram-300x142.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram-1024x484.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram-768x363.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/Agent-Pex_scaling-diagram-240x114.png 240w\" sizes=\"auto, (max-width: 1057px) 100vw, 1057px\" \/><\/figure>\n\n\n\n<p><strong>Features:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Multi-dimensional coverage: Evaluates argument validity, output compliance, plan sufficiency, and more.<\/li>\n\n\n\n<li>Model comparison: Enables side-by-side analysis of agent traces from different models (e.g., Claude-4, Gemini-Pro, o4-mini).<\/li>\n\n\n\n<li>Integration with benchmarks: Supports large-scale evaluation using Tau\u00b2, with over 5,000 traces across multiple domains.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"our-contributions\">Our contributions<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"specification-driven-automated-and-scalable\">Specification-driven, automated, and scalable<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Specification-driven evaluation:<\/strong> Unlike traditional testing, Agent-Pex leverages the rules embedded in agent prompts and traces, enabling precise, context-aware validation.<\/li>\n\n\n\n<li><strong>Automated test generation:<\/strong> Building on prior work (PromptPex), Agent-Pex can invert rules to generate adversarial tests, systematically probing agent robustness.<\/li>\n\n\n\n<li><strong>Scalability:<\/strong> Handles evaluation at enterprise scale, supporting both commercial deployments and academic research.<\/li>\n<\/ul>\n\n\n\n<p><strong>PromptPex connection:<\/strong> Agent-Pex builds on our prior work, PromptPex, which is an open-source tool for automatic test generation for language model prompts. Agent-Pex extends this approach to agentic traces, using specifications to generate tests that break rules and expose weaknesses.<\/p>\n\n\n\n<p><strong>Tau\u00b2 benchmark integration:<\/strong> Agent-Pex can be used to evaluate agents on the Tau\u00b2 benchmark, which simulates realistic multi-turn conversations in domains like telecom, retail, and airline. Tau\u00b2 introduces dual-control environments, where both agent and user interact with tools, providing a rigorous testbed for agent coordination and reasoning.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"results\">Results<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Trace-level evaluation:<\/strong> Agent-Pex produces detailed reports for individual agentic traces, scoring compliance across multiple dimensions.<\/li>\n\n\n\n<li><strong>Model comparison:<\/strong> Enables quantitative comparison of agent performance across models and tasks, revealing strengths and weaknesses.<\/li>\n\n\n\n<li><strong>Benchmark-scale analysis:<\/strong> Successfully evaluated 5,000+ traces from Tau\u00b2, comparing four models across three domains, with fine-grained analysis by domain and metric.<\/li>\n\n\n\n<li><strong>Automated test generation:<\/strong> Demonstrated that tests generated by Agent-Pex (and PromptPex) are more effective at exposing prompt non-compliance than baseline approaches.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"references\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2503.05070\" target=\"_blank\" rel=\"noopener noreferrer\">PromptPex: Automatic Test Generation for Language Model Prompts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/promptpex\">microsoft\/promptpex: Test Generation for Prompts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> &#8211; PromptPex repository<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/sierra-research\/tau2-bench\" target=\"_blank\" rel=\"noopener noreferrer\">Tau\u00b2 Benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blog.sigplan.org\/2024\/10\/22\/prompts-are-programs\/Prompts are Programs\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/blog.sigplan.org\/2024\/10\/22\/prompts-are-programs\/Prompts are Programs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Automated evaluation and testing of AI agents AI agents are rapidly transforming software, with projections of over a billion agents in operation by 2028. These agents, embedded in products like VS Code and M365 Copilot, perform increasingly complex tasks\u2014writing code, conducting research, and automating workflows. However, as agentic systems grow in complexity, understanding, debugging, and [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1159844","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2025-06-24","related-publications":[1167008],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Shraddha Barke","user_id":43605,"people_section":"Related people","alias":"sbarke"},{"type":"user_nicename","display_name":"Peli de Halleux","user_id":32253,"people_section":"Related people","alias":"jhalleux"},{"type":"guest","display_name":"Reshabh Sharma","user_id":1159853,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Ben Zorn","user_id":35154,"people_section":"Related people","alias":"zorn"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1159844","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1159844\/revisions"}],"predecessor-version":[{"id":1160231,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1159844\/revisions\/1160231"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1159844"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1159844"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1159844"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1159844"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1159844"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}