Hi, I’m Gagan Bansal and I’m a researcher at Microsoft. And today I want to talk to you about our recent work on applying societies of agents in markets. And although I’m presenting, this work was a collaboration between many amazing colleagues across Microsoft.
Capabilities of AI agents are improving rapidly. We’re quickly moving towards a future where each one of us will have personal agents. Now, in a world where everyone has agents, we believe societies of agents will drive new applications where our agents will have to interact with other agents. But how can we trust agents that we don’t control, or agents that might know things ours doesn’t, or even have competing goals?
At Microsoft, we’ve been building on our expertise in multi-agent frameworks like AutoGen and Magentic one to create useful societies of agents, ones that add value, save time, and don’t cause harm. To enable this future, we need to understand how agents behave when they interact at scale. Recent examples from the open source community, where agents could talk freely on the forums, only underscores how timely and important this question is.
Let me show you what we built and what we found. Imagine a marketplace where all the buying and selling is done by agents representing people. We call these settings as two sided agentic markets. This setting is a great testbed for society’s agents, because every agent has access to different information and competing incentives to systematically study two sided markets.
We at Microsoft Research built a new simulation environment called Magentic Marketplace. Here, assistant agents represent customers, service agents represents businesses, and a marketplace sits in the middle handles search, communication and transaction between agents. Here’s a typical interaction.
Suppose a customer wants to find a restaurant with something specific, like delicious empanadas and outdoor seating. Their assistant can search the marketplace. Talk to the service agents. Ask about menus, check amenities, and finally make a reservation. This framework allows us to test hundreds of agents buying and selling in parallel.
We used it to systematically ask many research questions. Do these agents even add value for customers and businesses? Does the quality of search result impact their behavior? Are they vulnerable to any biases or manipulation? We built this framework as a general research tool. It can be used to ask many other questions, even for domains beyond markets.
We started by asking whether agents actually add value for consumers. To find out, we implemented agents using frontier and open source models and computed the welfare that they achieve. Here, welfare is the value customers get from their purchase minus the price they paid. Higher is better. We observed that when agents have access to high quality search results, frontier models like GPT 5 and Sonnet 4 reached near optimal welfare.
They talk to business agents, gathered missing information and made good choices. But we found that agent performance was tied to the quality of search results. When the search quality dropped, performance dropped. We also observed that there is still a large gap between the welfare achieved by frontier models and open source models.
In addition to the impact of the quality of search results, we also want to test whether the number of search results impact welfare. So we gave agents more search results, varying them from 3 to 100, and expected the welfare to increase. But the opposite happened, resulting in a surprising paradox of choice.
Welfare dropped for almost every model. This happened because agents didn’t explore enough and contacted few businesses. We also conducted experiments that tested whether the order of offers from service agents matters. It did dramatically. Almost 80 to 100% of the agents accepted the first proposal they received.
They never even looked at the alternatives. Think about what this means for a real market. Speed beats quality. A business gains more from responding fast than from offering a better deal. That’s not a healthy dynamic. We also tested vulnerability to fake reviews, fake awards and prompt injection.
Some frontier models resisted everything, but others were completely compromised. All payments were redirected to the attackers. These are early findings and markets are just the beginning. Societies of agents will emerge anywhere. Agents represent people with different interests such as supply chains, hiring and negotiation.
Magnetic marketplace is open source and GitHub for the community to run experiments, stress test agents and help answer the harder questions. What guardrails do we need? How should markets be designed when both sides are AI? What we’ve shown is that simulation matters. Agents can add value, but they also inherit biases, fall for manipulation, and make choices that reward speed over quality.
These are not edge cases. These are behaviors that only emerge when societies of agents are tested at scale. If these agents are going to take high stakes decisions on our behalf, such as making transactions to other agents, we should understand their behaviors and biases before deployment and not after.
Please check out our papers and GitHub repository for more information. And thank you for attending the Microsoft Research Forum.