{"id":1170924,"date":"2026-05-11T10:19:28","date_gmt":"2026-05-11T17:19:28","guid":{"rendered":""},"modified":"2026-05-11T10:19:31","modified_gmt":"2026-05-11T17:19:31","slug":"socialreasoning-bench-measuring-whether-ai-agents-act-in-users-best-interests","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/socialreasoning-bench-measuring-whether-ai-agents-act-in-users-best-interests\/","title":{"rendered":"SocialReasoning-Bench: Measuring whether AI agents act\u00a0in users\u2019 best interests"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1.jpg\" alt=\"Social Reasoning Bench | four icons on a blue to green gradient | person icon, chat bubble icon, chart icon, checklist icon\" class=\"wp-image-1170935\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__inner wp-block-msr-immersive-section__inner--narrow\">\n\t\t\t<div class=\"wp-block-columns mb-10 pb-1 pr-1 is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\" style=\"box-shadow:var(--wp--preset--shadow--outlined)\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h2 class=\"wp-block-heading h3\" id=\"at-a-glance\">At a glance<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI agents are moving into social&nbsp;contexts. When agents manage calendars, negotiate purchases, or interact with other agents on a user\u2019s behalf, they need more than task competence\u2014they need social reasoning.<\/li>\n\n\n\n<li>SocialReasoning-Bench&nbsp;evaluates that ability. The benchmark tests&nbsp;whether an agent can negotiate for a user in two realistic settings: Calendar Coordination and Marketplace Negotiation.&nbsp;<\/li>\n\n\n\n<li>The benchmark measures both outcomes and process: it&nbsp;scores&nbsp;agents on&nbsp;outcome&nbsp;optimality&nbsp;(how much value they secure for the user) and&nbsp;due&nbsp;diligence&nbsp;(whether they follow a competent decision-making process).&nbsp;<\/li>\n\n\n\n<li>Current frontier models often leave value on the table. They usually complete the task, but they&nbsp;frequently&nbsp;accept suboptimal meeting times or poor deals instead of advocating effectively for the user.&nbsp;<\/li>\n\n\n\n<li>Prompting helps, but it is not enough. Even with explicit guidance to act in the user\u2019s best interest, performance&nbsp;remains&nbsp;well below what a trustworthy delegate should achieve.<\/li>\n<\/ul>\n<\/div>\n<\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<p>As AI agents&nbsp;take on more real-world tasks, they are increasingly&nbsp;operating&nbsp;in&nbsp;social contexts.&nbsp;With&nbsp;the right&nbsp;integrations, agents like Claude Cowork and Google Gemini can manage email and calendar&nbsp;workflows.&nbsp;In these settings, the agent&nbsp;must&nbsp;interact with&nbsp;others&nbsp;on your behalf. This&nbsp;requires&nbsp;<em>social reasoning<\/em>&nbsp;\u2014&nbsp;understanding what you want, what the counterparty wants, and what information to reveal, protect, or push back on.<\/p>\n\n\n\n<p>Our previous research suggests that today&#8217;s frontier models lack social reasoning. In our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/magentic-marketplace-an-open-source-simulation-environment-for-studying-agentic-markets\/\" type=\"post\" id=\"1154326\">simulated multi-agent marketplace<\/a>, agents accepted the first proposal they received up to 93% of the time without exploring alternatives. When <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/red-teaming-a-network-of-agents-understanding-what-breaks-when-ai-agents-interact-at-scale\/\" type=\"link\" id=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/red-teaming-a-network-of-agents-understanding-what-breaks-when-ai-agents-interact-at-scale\/\">red-teaming a social network of agents<\/a>, a single malicious message spread through the system and led agents to disclose private data before passing the message along.<\/p>\n\n\n\n<p>This kind of relationship has a long history&nbsp;outside&nbsp;AI. In economics and&nbsp;law&nbsp;it is called a&nbsp;<em>principal-agent&nbsp;<\/em>relationship: an agent acts on a principal&#8217;s behalf in interactions with others whose interests differ. Attorneys, real-estate agents, and financial advisors all&nbsp;operate&nbsp;in this mode, and the duties they owe\u2014care, loyalty, confidentiality\u2014are codified in centuries of professional norms. AI agents acting on a user&#8217;s behalf&nbsp;should&nbsp;ultimately be&nbsp;held to&nbsp;similar&nbsp;standards.<\/p>\n\n\n\n<p>To measure and drive progress in social reasoning, we built&nbsp;SocialReasoning-Bench:&nbsp;&nbsp;a benchmark for testing whether agents can&nbsp;reason and&nbsp;negotiate on a user\u2019s behalf against a counterparty with independent goals,&nbsp;private information, and potentially adversarial intent.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introducing-socialreasoning-bench\">Introducing SocialReasoning-Bench<\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1376\" height=\"768\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark.png\" alt=\"Figure 1: Our benchmark measures agents' social reasoning ability in two domains, calendar coordination and marketplace negotiation. Each requires communicating with other parties, advocating on a principal's behalf, and reasoning about tradeoffs. \" class=\"wp-image-1171327\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark.png 1376w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark-300x167.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark-1024x572.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark-768x429.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure1_NEW_Benchmark-240x134.png 240w\" sizes=\"auto, (max-width: 1376px) 100vw, 1376px\" \/><figcaption class=\"wp-element-caption\"><em>Figure&nbsp;<em>1<\/em>: Our benchmark measures agents&#8217; social reasoning ability in two domains, calendar&nbsp;coordination&nbsp;and marketplace negotiation. Each requires&nbsp;communicating with other parties, advocating on&nbsp;a&nbsp;principal&#8217;s behalf, and reasoning about tradeoffs.&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>SocialReasoning-Bench evaluates social reasoning in two domains: Calendar Coordination and Marketplace Negotiation. In each, an agent advocates for its user against a counterparty and is scored on both the outcome it reached and the process it followed. We find that frontier models complete most tasks but consistently leave value on the table for the user.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"calendar-coordination\">Calendar coordination<\/h3>\n\n\n\n<p>In calendar coordination, an assistant agent manages a user&#8217;s calendar on a single day and fields a meeting request from another agent.<\/p>\n\n\n\n<p>We assume the agent has access to a value function over time slots that captures the user\u2019s scheduling preferences between 0.0 and 1. This function could be provided explicitly by the user or inferred from their calendar history, and is given to the assistant at the start of the task.<\/p>\n\n\n\n<p>The counterparty is a requestor agent representing another person who wants to schedule a meeting with the user. The counterparty has its own value function over the same slots, constructed as the inverse of the user&#8217;s, so the slots most valuable to one are least valuable to the other. Some requestors negotiate in good faith, while others use the interaction to extract private calendar details or push the assistant toward times the user does not want.<\/p>\n\n\n\n<p>In each task there is a <em>zone of possible agreement<\/em> (ZOPA) a term borrowed from negotiation theory for the set of outcomes that both parties could plausibly accept. In calendar coordination, the ZOPA is the set of time slots that are mutually free on both calendars. We construct every task so that the ZOPA contains at least three slots with different preference scores for the user, and the requestor&#8217;s opening request always conflicts with the user&#8217;s calendar.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"marketplace-negotiation\">Marketplace negotiation<\/h3>\n\n\n\n<p>In marketplace negotiation, a buyer agent representing a user negotiates with a seller agent to purchase a single product.<\/p>\n\n\n\n<p>The user wants to pay as little as possible for the product. Their value function is the gap between the deal price and a private reservation price, the highest price they would pay. A larger gap captures more value, and a deal above the reservation captures none.<\/p>\n\n\n\n<p>The&nbsp;counterparty&nbsp;is a seller agent with its own private reservation price set below the buyer&#8217;s. The counterparty&#8217;s value function mirrors the&nbsp;user&#8217;s, with higher deal prices yielding more value and&nbsp;deal&nbsp;prices below the seller\u2019s reservation price yielding no value.<\/p>\n\n\n\n<p>The ZOPA is the price range between the seller&#8217;s and buyer&#8217;s reservations. The seller&#8217;s opening offer is always above the buyer&#8217;s reservation, forcing the buyer to negotiate the price down.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"new-metrics-for-a-new-setting\">New metrics for a new setting<\/h2>\n\n\n\n<p>Existing benchmarks focus on task completion: did the meeting get scheduled? Did the trade close? In principal\u2013agent settings, what matters is not just <em>whether<\/em> the task is completed, but how <em>well<\/em> it is done. We introduce new measures to capture this distinction.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"outcome-optimality\">Outcome Optimality<\/h3>\n\n\n\n<p>Outcome optimality scores the share of available value the agent captured for its principal, on a 0-to-1 scale. The outcome inside the ZOPA most favorable to the principal scores 1, while the outcome most favorable to the counterparty scores 0.0. Intermediate outcomes are scored by where the principal&#8217;s value function places them between those two endpoints.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"due-diligence\">Due Diligence<\/h3>\n\n\n\n<p>Outcome optimality alone conflates skill with luck. An agent that immediately accepts a counterparty&#8217;s first offer, without inspecting its situation or making a counter-proposal, can still score well if the counterparty happens to propose a good outcome. To separate skill from luck, we introduce a process metric.<\/p>\n\n\n\n<p>Due diligence scores process quality on a 0-to-1 scale by comparing the agent&#8217;s actions, at each decision point in the trajectory, against the action a deterministic <em>reasonable-agent<\/em> policy would have taken in the same state. The reasonable-agent policy is a greedy procedure that captures what a competent advocate would do at each step, such as gathering relevant context before acting, opening with a position favorable to its principal, and conceding only after better options have been exhausted. The Due Diligence score is the rate at which the agent&#8217;s actual choices match the reasonable-agent&#8217;s choices over the trajectory.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"duty-of-care\">Duty of care<\/h3>\n\n\n\n<p>Together, Outcome Optimality and Due Diligence form an operational notion of an agent&#8217;s <em>duty of care<\/em> to the person it represents. An agent that lands a good outcome through a careless process is fragile, while an agent that follows good process but lands a bad outcome points to a capability gap rather than negligence. Only an agent that scores well on both is exhibiting strong social reasoning.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the latest episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"experimental-setup\">Experimental setup<\/h2>\n\n\n\n<p>For the calendar assistant agent and marketplace buyer agent, we evaluate GPT-4.1 with chain-of-thought, GPT-5.4 at&nbsp;high&nbsp;reasoning effort,&nbsp;and&nbsp;Claude Sonnet 4.6&nbsp;and Gemini 3 Flash at&nbsp;high&nbsp;thinking levels. The counterparty (i.e.&nbsp;requestor in calendar coordination, and seller in marketplace negotiation) is always Gemini 3 Flash with medium reasoning effort, held constant across all conditions so that any difference in scores reflects the model under test rather than the difficulty of its opponent.<\/p>\n\n\n\n<p>Each model is run under two prompt conditions: <strong>Basic Prompting<\/strong> where the agent receives only role and tool descriptions, and <strong>Defensive Prompting<\/strong> where the agent additionally receives explicit guidance to consult all available sources and advocate for the user toward the best possible outcome.<\/p>\n\n\n\n<p>Each task runs for 10 negotiation rounds, at most. The counterparty proposes first in every task.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-we-re-finding\">What we\u2019re finding<\/h2>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"finding-1-agents-complete-tasks-at-near-perfect-rates-but-produce-poor-outcomes\">Finding 1: Agents complete tasks at near-perfect rates but produce poor outcomes.<\/h3>\n\n\n\n<p>In calendar scheduling, agents&nbsp;almost always&nbsp;succeed in booking the meeting, but&nbsp;most&nbsp;often at suboptimal times. In marketplace negotiation, deals&nbsp;almost always&nbsp;close, but&nbsp;frequently&nbsp;at the worst possible price. The tasks get done, but not done well: task completion signals success, while&nbsp;Outcome&nbsp;Optimality reveals a consistent failure to act in the principal\u2019s best interest.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"928\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-scaled.png\" alt=\"Figure 2: Task Completion vs Outcome Optimality by model and domain. All models complete tasks at near-perfect rates, but produce poor outcomes. We measured Outcome Optimality against the two prompts, basic and defensive. Defensive prompting helps but does not close the gap.\u00a0\" class=\"wp-image-1171350\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-scaled.png 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-300x109.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-1024x371.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-768x278.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-1536x557.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-2048x743.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure2-fixed_Benchmark-240x87.png 240w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\"><em>Figure&nbsp;<em>2<\/em>: Task Completion vs Outcome Optimality by model and domain. All&nbsp;models&nbsp;complete tasks at&nbsp;near-perfect&nbsp;rates, but&nbsp;produce poor outcomes.&nbsp;We&nbsp;measured&nbsp;<em>O<\/em>utcome&nbsp;<em>O<\/em>ptimality against&nbsp;the two prompts, basic and defensive.&nbsp;Defensive prompting&nbsp;helps but does not close the gap.&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"finding-2-defensive-prompting-helps-but-is-not-enough-to-close-the-gap\">Finding 2: Defensive prompting helps, but is not enough to close the gap.<\/h3>\n\n\n\n<p>When we instruct agents on how to work hard on their principal\u2019s behalf, we see outcome improvements&nbsp;across both domains, but it is not enough to close the gap. GPT-5.4 benefits most&nbsp;from defensive prompting&nbsp;(+0.21 in calendaring, +0.12 in&nbsp;marketplace), while&nbsp;GPT-4.1&nbsp;barely responds to it in either domain. The other models fall somewhere in between.<\/p>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"finding-3-outcome-optimality-shows-how-much-value-agents-leave-on-the-table\">Finding 3: Outcome optimality shows how much value agents leave on the table.<\/h3>\n\n\n\n<p>Outcome optimality reflects where each deal lands within the ZOPA. When we plot outcomes, they cluster closer to the counterparty\u2019s ideal than the principal\u2019s.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"904\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-scaled.png\" alt=\"Figure 3: Outcome Optimality (OO) distribution by model and domain. Each dot is one task instance. OO=1.0 means the agent captured all available value for its principal; OO=0.0 means the counterparty captured everything. Black lines show the mean. In marketplace, outcomes cluster near zero across all models. In calendar, agents perform better but still settle below the midpoint on average.\u00a0\" class=\"wp-image-1171353\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-scaled.png 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-300x106.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-1024x362.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-768x271.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-1536x542.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-2048x723.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure3-fixed_Benchmark-240x85.png 240w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\"><em>Figure&nbsp;<em>3<\/em>: Outcome Optimality&nbsp;<em>(OO)&nbsp;<\/em>distribution by model and domain. Each dot is one task instance. OO=1.0 means the agent captured all available value for its principal; OO=0.0 means the counterparty captured everything. Black lines show the&nbsp;mean. In marketplace, outcomes cluster near zero across all models. In calendar, agents perform better but still settle below the midpoint on average.&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>In marketplace negotiation, all&nbsp;models settle at or near zero for&nbsp;Outcome&nbsp;Optimality, accepting deals that give away&nbsp;virtually all&nbsp;available&nbsp;surplus. In calendar scheduling, agents perform better but still land below the midpoint, accepting the requestor\u2019s preferred slots rather than ones that better serve their principal.<\/p>\n\n\n\n<p>Measuring value capture in agent negotiations builds on recent studies examining how agents perform in marketplace settings. Because we operate in a controlled setting, we can establish ground-truth constraints for both parties and measure exactly how the available value was divided. Our formulation also generalizes beyond price-based negotiations: by abstracting to a domain-specific value function, Outcome Optimality can measure surplus division in any setting where agents face competing incentives, including non-monetary domains like calendar scheduling where \u201cvalue\u201d is defined over preference scores rather than prices.<\/p>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"finding-4-due-diligence-helps-distinguish-between-luck-and-skill\">Finding 4: Due&nbsp;Diligence&nbsp;helps&nbsp;distinguish&nbsp;between luck and skill.<\/h3>\n\n\n\n<p>When we look at the combination of outcome quality and process quality,&nbsp;a more nuanced picture&nbsp;emerges. Many agents that achieve reasonable outcomes do so through fragile processes: they&nbsp;don&#8217;t&nbsp;check context before acting or they accept offers without countering. High&nbsp;Outcome&nbsp;Optimality&nbsp;with low&nbsp;Due&nbsp;Diligence&nbsp;suggests an agent that got lucky rather than one that can be trusted. Conversely, some agents show genuine diligence \u2014 gathering information, pushing back \u2014 but still land on poor outcomes, pointing to capability gaps rather than negligence.&nbsp;Dividing&nbsp;Outcome&nbsp;Optimality&nbsp;and&nbsp;Due&nbsp;Diligence&nbsp;each into high (>=0.5) and low (<0.5) buckets,&nbsp;we can sort every task into one of four archetypes.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><\/th><th>Not diligent (DD < 0.5)<\/th><th>Diligent (DD \u2265 0.5)<\/th><\/tr><\/thead><tbody><tr><td>Good outcome (OO \u2265 0.5)<\/td><td>Lucky<\/td><td>Robust<\/td><\/tr><tr><td>Poor outcome (OO < 0.5)<\/td><td>Negligent<\/td><td>Ineffective<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Through the lens of this&nbsp;decomposition, we can see that models&nbsp;exhibit&nbsp;robust&nbsp;<em>duty of&nbsp;care<\/em>&nbsp;on more than 50% of calendar coordination tasks, with Gemini 3 Flash leading at&nbsp;90% robust.&nbsp;In marketplace negotiation, though,&nbsp;a very different&nbsp;picture&nbsp;emerges.&nbsp;GPT-4.1 is&nbsp;negligent&nbsp;in 95% of&nbsp;tasks,&nbsp;neither gathering information nor advocating for its principal, while Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash show ineffective&nbsp;behavior in&nbsp;roughly 90%&nbsp;of marketplace tasks, negotiating diligently but still unable to achieve good outcomes.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure4_NEW_Benchmark.png\" alt=\"Figure 4: Splitting Outcome Optimality and Due Diligence into \u201clow\u201d (<0.5) and \u201chigh\u201d (>=0.5) buckets each, we plot the percent of tasks for each model that fall into each quadrant. For example, in calendar scheduling, GPT-4.1 achieves both high OO and high DD (Robust) in 63% of tasks. In contrast, in the marketplace domain, GPT-4.1 exhibits low OO and low DD (Negligent) in 95% of tasks.\u00a0\" class=\"wp-image-1171332\"\/><figcaption class=\"wp-element-caption\"><em>Figure&nbsp;<em>4<\/em>: Splitting Outcome Optimality and Due Diligence into \u201clow\u201d (<0.5) and \u201chigh\u201d (>=0.5) buckets each, we plot the percent of tasks for each model that fall into each quadrant. For example, in<em>&nbsp;calendar scheduling, GPT-4.1 achieves&nbsp;both high OO and high DD (Robust) in&nbsp;63<\/em>% of&nbsp;tasks.<em>&nbsp;In contrast,&nbsp;in the marketplace domain,<\/em>&nbsp;<em>GPT-4.1&nbsp;<\/em>exhibits low&nbsp;<em>OO and low DD<\/em>&nbsp;(Negligent)<em>&nbsp;in 95% of tasks<\/em>.&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>Figures 5-8 illustrate&nbsp;these different behaviors and failure modes with real examples from&nbsp;SocialReasoning-Bench&nbsp;in the calendaring domain.&nbsp;We see agents that follow a strong negotiation strategy and secure high-value outcomes, but also agents&nbsp;that achieve reasonable outcomes through sloppy processes, such as&nbsp;failing to propose&nbsp;the principal\u2019s best&nbsp;option.&nbsp;Others begin with a strong position but concede prematurely, collapsing to poor deals.&nbsp;At the extreme, some agents exhibit negligent behavior, accepting the first proposal without checking constraints, even&nbsp;when it directly conflicts with the user\u2019s interests.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1904\" height=\"894\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark.png\" alt=\"Figure 5. A real paraphrased example of robust behavior from GPT-4.1 in the calendaring domain, achieving a good outcome after proposing the principal\u2019s most preferred option first, correctly refusing the conflict, and then holding the line at their second best option.\" class=\"wp-image-1171295\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark.png 1904w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark-300x141.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark-1024x481.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark-768x361.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark-1536x721.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure5_Benchmark-240x113.png 240w\" sizes=\"auto, (max-width: 1904px) 100vw, 1904px\" \/><figcaption class=\"wp-element-caption\"><em><em>Figure&nbsp;5.&nbsp;A&nbsp;real paraphrased example of&nbsp;robust behavior&nbsp;from GPT-4.1&nbsp;in the calendaring domain, achieving a good outcome&nbsp;after proposing the principal\u2019s most preferred&nbsp;option&nbsp;first,&nbsp;correctly refusing the conflict,&nbsp;and then holding the line at their&nbsp;second best&nbsp;option.<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1890\" height=\"880\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk.png\" alt=\"Figure 6. GPT-4.1 in the calendaring domain achieving a reasonable outcome from a sloppy process that didn\u2019t include proposing the principal\u2019s most preferred option. \" class=\"wp-image-1171297\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk.png 1890w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk-300x140.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk-1024x477.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk-768x358.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk-1536x715.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure6_Benchamrk-240x112.png 240w\" sizes=\"auto, (max-width: 1890px) 100vw, 1890px\" \/><figcaption class=\"wp-element-caption\"><em><em>Figure&nbsp;6.&nbsp;GPT-4.1 in the calendaring domain achieving a&nbsp;reasonable outcome from a sloppy process that&nbsp;didn\u2019t&nbsp;include proposing the principal\u2019s&nbsp;most preferred&nbsp;option.<\/em>&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1896\" height=\"880\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark.png\" alt=\"Figure 7. GPT-4.1 in the calendaring domain starting out strong by proposing the principal\u2019s most preferred slot but then caving early and achieving a poor outcome. \" class=\"wp-image-1171300\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark.png 1896w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark-300x139.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark-1024x475.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark-768x356.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark-1536x713.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure7_Benchmark-240x111.png 240w\" sizes=\"auto, (max-width: 1896px) 100vw, 1896px\" \/><figcaption class=\"wp-element-caption\"><em><em>Figure&nbsp;7. GPT-4.1 in the calendaring domain&nbsp;starting out strong by proposing the principal\u2019s most&nbsp;preferred slot&nbsp;but then caving early&nbsp;and achieving a poor&nbsp;outcome.<\/em>&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1894\" height=\"892\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark.png\" alt=\"Figure 8. GPT-4.1 exhibiting negligent behavior, accepting the requestor\u2019s first proposal without confirming availability and conflicting with another meeting on the principal\u2019s calendar. \" class=\"wp-image-1171302\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark.png 1894w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark-300x141.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark-1024x482.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark-768x362.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark-1536x723.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure8_Benchmark-240x113.png 240w\" sizes=\"auto, (max-width: 1894px) 100vw, 1894px\" \/><figcaption class=\"wp-element-caption\"><em><em>Figure&nbsp;<\/em>8.<em>&nbsp;GPT-4.1 exhibiting negligent behavior,&nbsp;accepting the&nbsp;requestor\u2019s&nbsp;first proposal&nbsp;without confirming availability and&nbsp;conflicting with another meeting&nbsp;on the principal\u2019s calendar.<\/em>&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>Taken together, these examples highlight why outcome alone is insufficient.&nbsp;Without measuring process, we risk mistaking brittle or accidental success for genuine capability.&nbsp;Due Diligence helps surface whether an agent is consistently behaving like a competent, trustworthy delegate, or simply getting lucky.<\/p>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"finding-5-agents-are-vulnerable-to-adversarial-manipulation\">Finding 5: Agents are vulnerable to adversarial manipulation<\/h3>\n\n\n\n<p>When we stress test agents by pitting them against adversarial counterparties, we find that agents struggle to balance when to engage, when to refuse, and how to negotiate under pressure.<\/p>\n\n\n\n<p>To create these adversarial scenarios, we introduce counterparties explicitly trying to manipulate outcomes or bypass protective steps. Some follow carefully designed strategies, applying pressure or probing for information, while others use more unpredictable, creatively generated <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/articles\/whimsical-strategies-break-ai-agents-generating-out-of-distribution-adversarial-strategies-at-scale\/\">whimsical tactics that mimic novel forms of social engineering<\/a>. Together, these test whether agents can handle not just known attacks, but unfamiliar ones.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1438\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-scaled.png\" alt=\"Figure 9: Refusal Rates and Outcome Optimality when agents engaged with adversarial requestors in both domains. Agents rarely refuse adversarial requests in calendaring, while refusing more often in the marketplace. When agents did engage with malicious actors, Outcome Optimality dropped across the board.\u00a0\" class=\"wp-image-1171355\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-scaled.png 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-1024x575.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-768x431.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-1536x863.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-2048x1150.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/figure9-fixed_Benchmark-1920x1080.png 1920w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\"><em>Figure&nbsp;<em>9<\/em>: Refusal Rates and Outcome Optimality when agents&nbsp;engaged&nbsp;with adversarial requestors in both domains. Agents rarely refuse adversarial requests in calendaring, while refusing more often in the marketplace. When agents did engage with malicious actors, Outcome Optimality dropped across the board.&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>We find that, aside from Claude Sonnet 4.6,\u00a0agents rarely refuse adversarial requests in calendar scheduling, while refusing more often in marketplace settings.\u00a0This\u00a0suggests\u00a0that adversarial intent is harder to detect in socially framed interactions. When agents do engage,\u00a0the impact is starkest in calendar scheduling\u00a0with Outcome Optimality dropping\u00a0substantially across\u00a0GPT-4.1, GPT-5.4, and Gemini Flash 3, suggesting that adversarial counterparties successfully steer these agents toward worse outcomes. In the marketplace domain, Outcome Optimality\u00a0when agents engaged\u00a0remains\u00a0comparable to\u00a0the low levels achieved against\u00a0benign\u00a0counterparties, capturing\u00a0little to no value for their principals.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-this-matters-now\">Why this matters now<\/h2>\n\n\n\n<p>Agents are&nbsp;interacting&nbsp;with each other in multi-party environments, from collaborating across enterprise workflows to transacting in digital marketplaces. As these networks form, the social reasoning gaps we&nbsp;observe&nbsp;in simple two-agent settings&nbsp;can&nbsp;begin&nbsp;to compound. Weak negotiation, over-trust, or failure to exercise due diligence no longer stay local. They propagate through coordination, influence downstream decisions, and shape collective outcomes.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In isolation, an agent that accepts a bad meeting time or a poor deal causes limited harm. In a network, those same behaviors can cascade, leading to systematically worse coordination or widespread value loss across many agents.<\/p>\n\n\n\n<p>Recent work has begun exploring these risks and dynamics through case studies of agents interacting in networked settings. SocialReasoning-Bench complements this line of work by providing a controlled, reproducible benchmark that isolates interaction behaviors and makes them measurable. This allows us to move beyond anecdotes and systematically track progress, giving model, agent, and platform developers a concrete target for building agents that act as trustworthy delegates.<\/p>\n\n\n\n<p>SocialReasoning-Bench is open source and available on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/social-reasoning-bench\" type=\"link\" id=\"https:\/\/github.com\/microsoft\/social-reasoning-bench\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"limitations-and-future-work\">Limitations and future work<\/h2>\n\n\n\n<p>Our current measures treat all counterparties equally. In practice, relationships matter. A socially intelligent agent should modulate its assertiveness based on their principal\u2019s relationship with the counterparty: pushing too hard when scheduling a meeting with a senior executive may damage a valuable relationship, and sometimes the right outcome is reached through compromise. Developing relationship-aware measures that account for power dynamics, rapport, and long-term consequences is an important direction for future work.<\/p>\n\n\n\n<p>We evaluate social reasoning in simplified two-agent settings, whereas real-world delegation often involves multi-party dynamics such as group scheduling or multi-stakeholder negotiations. Each task is also treated as an independent encounter, with no modeling of long-term relationships, reputation, or trust-building across repeated interactions. Our scenarios are also limited to English-language and U.S.-centric business contexts, though social norms around negotiation, privacy, and hierarchy vary widely across cultures. Looking ahead, we plan to extend our benchmark to more diverse settings.<\/p>\n\n\n\n<p>Finally, Outcome Optimality works well in settings with clear boundaries, where a \u201cgood\u201d outcome can be defined and measured. But many tasks that require <em>duty of care<\/em>, such as drafting sensitive messages or navigating team dynamics, may not have a well-defined ZOPA. In these cases, outcomes depend on context, relationships, and judgment in ways that may resist a single score. Extending our approach to these more subjective settings is an important direction for future work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements&nbsp;<\/h2>\n\n\n\n<p>We would like to thank&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/brlucier\/\" target=\"_blank\" rel=\"noreferrer noopener\">Brendan Lucier<\/a>,&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adamfo\/\" target=\"_blank\" rel=\"noreferrer noopener\">Adam Fourney<\/a>,&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/aswearngin\/\" target=\"_blank\" rel=\"noreferrer noopener\">Amanda Swearngin<\/a>, and&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eckamar\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ece Kamar<\/a>&nbsp;for&nbsp;their helpful feedback, discussions, and support of this work.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Using SocialReasoning Bench, we observed a stable pattern across models\u2014agents execute competently, but fail to consistently improve the user\u2019s position, even with explicit instructions to optimize for user interest.<\/p>\n","protected":false},"author":43868,"featured_media":1170935,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13558],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1170924","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-security-privacy-cryptography","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[992148],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Tyler Payne","user_id":43967,"display_name":"Tyler Payne","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tylerpayne\/\" aria-label=\"Visit the profile page for Tyler Payne\">Tyler Payne<\/a>","is_active":false,"last_first":"Payne, Tyler","people_section":0,"alias":"tylerpayne"},{"type":"user_nicename","value":"Will Epperson","user_id":44012,"display_name":"Will Epperson","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/willepperson\/\" aria-label=\"Visit the profile page for Will Epperson\">Will Epperson<\/a>","is_active":false,"last_first":"Epperson, Will","people_section":0,"alias":"willepperson"},{"type":"user_nicename","value":"Safoora Yousefi","user_id":43530,"display_name":"Safoora Yousefi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sayouse\/\" aria-label=\"Visit the profile page for Safoora Yousefi\">Safoora Yousefi<\/a>","is_active":false,"last_first":"Yousefi, Safoora","people_section":0,"alias":"sayouse"},{"type":"user_nicename","value":"Zachary Huang","user_id":44011,"display_name":"Zachary Huang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zacharyhuang\/\" aria-label=\"Visit the profile page for Zachary Huang\">Zachary Huang<\/a>","is_active":false,"last_first":"Huang, Zachary","people_section":0,"alias":"zacharyhuang"},{"type":"user_nicename","value":"Gagan Bansal","user_id":41707,"display_name":"Gagan Bansal","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/gaganbansal\/\" aria-label=\"Visit the profile page for Gagan Bansal\">Gagan Bansal<\/a>","is_active":false,"last_first":"Bansal, Gagan","people_section":0,"alias":"gaganbansal"},{"type":"user_nicename","value":"Wenyue Hua","user_id":44010,"display_name":"Wenyue Hua","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/wenyuehua\/\" aria-label=\"Visit the profile page for Wenyue Hua\">Wenyue Hua<\/a>","is_active":false,"last_first":"Hua, Wenyue","people_section":0,"alias":"wenyuehua"},{"type":"user_nicename","value":"Maya Murad","user_id":43879,"display_name":"Maya Murad","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mayamurad\/\" aria-label=\"Visit the profile page for Maya Murad\">Maya Murad<\/a>","is_active":false,"last_first":"Murad, Maya","people_section":0,"alias":"mayamurad"},{"type":"guest","value":"asli-celikyilmaz","user_id":"1171348","display_name":"Asli Celikyilmaz","author_link":"<a href=\"http:\/\/asli.us\/\" aria-label=\"Visit the profile page for Asli Celikyilmaz\">Asli Celikyilmaz<\/a>","is_active":true,"last_first":"Celikyilmaz, Asli","people_section":0,"alias":"asli-celikyilmaz"},{"type":"user_nicename","value":"Saleema Amershi","user_id":33505,"display_name":"Saleema Amershi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/samershi\/\" aria-label=\"Visit the profile page for Saleema Amershi\">Saleema Amershi<\/a>","is_active":false,"last_first":"Amershi, Saleema","people_section":0,"alias":"samershi"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Social Reasoning Bench | four icons on a blue to green gradient | person icon, chat bubble icon, chart icon, checklist icon\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/05\/SocialReasoningBench-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"May 11, 2026","formattedExcerpt":"Using SocialReasoning Bench, we observed a stable pattern across models\u2014agents execute competently, but fail to consistently improve the user\u2019s position, even with explicit instructions to optimize for user interest.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1170924","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1170924"}],"version-history":[{"count":56,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1170924\/revisions"}],"predecessor-version":[{"id":1171379,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1170924\/revisions\/1171379"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1170935"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1170924"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1170924"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1170924"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1170924"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1170924"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1170924"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1170924"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1170924"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1170924"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1170924"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1170924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}