Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney; Gagan Bansal; Hussein Mozannar; Cheng Tan; Eduardo Salinas; Erkang (Eric) Zhu; Friederike Niedtner; Grace Proebsting; Griffin Bassman; Jack Gerrits; Jacob Alber; Peter Chang; Ricky Loynd; Robert West; Victor Dibia; Ahmed Awadallah; Ece Kamar; Rafah Hosn; Saleema Amershi

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney ,
Gagan Bansal ,
Hussein Mozannar ,
Cheng Tan ,
Eduardo Salinas ,
Erkang (Eric) Zhu ,
Friederike Niedtner ,
Grace Proebsting ,
Griffin Bassman ,
Jack Gerrits ,
Jacob Alber ,
Peter Chang ,
Ricky Loynd ,
Robert West ,
Victor Dibia ,
Ahmed Awadallah ,
Ece Kamar ,
Rafah Hosn ,
Saleema Amershi

MSR-TR-2024-47 | November 2024

Published by Microsoft

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator also directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. Our experiments show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Notably, Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards the vision of generalist agentic systems. Moreover, Magentic-One’s modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One and AutoGenBench, a standalone agentic evaluation tool. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks where actions may produce side-effects, in a rigorous and contained way. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one (opens in new tab).

Outils connexes

Magentic-One

novembre 12, 2024

Magentic-One is a generalist multi-agent system created to address intricate web and file-based tasks. By utilizing an intelligent Orchestrator alongside specialized agents, it facilitates the automation of complex, multi-step activities across various environments.

Accès

Tool-space Interference: An emerging problem for LLM agents

Tool-space interference occurs when adding an otherwise reasonable agent or tool to a team or agent reduces end-to-end task performance. We study the phenomenon in an analysis of 1470 MCP servers and make practical suggestions for MCP client, server, and marketplace developers.

Explore more

All Research Forum sessions

Transcript

Tool-space Interference: An emerging problem for LLM agents

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

KAREN EASTERBROOK: As we progress in all directions of research at MSR [Microsoft Research], we stay true to a core part of our mission: advancing AI responsibly by understanding not just what these systems can do but how and why they sometimes fail.

Tyler Payne, a senior research software engineer with Microsoft Research AI Frontiers in New York City, is investigating how AI agents perform when they’re given access to multiple tools—from calculators to code interpreters. Surprisingly, his findings show that adding more tools can sometimes hurt performance, introducing “tool-space interference.”

Over to you, Tyler.

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

TYLER PAYNE: Hi, my name’s Tyler, and I’m a research engineer at Microsoft Research AI Frontiers.

Today, I’m going to be talking about an emerging problem for LLM agents that we call tool-space interference. This was an exploration done over the summer of 2025 in collaboration with my colleagues here at AI Frontiers.

AI agents powered by LLMs have become a popular topic in both research and industry. In general, an agent is a system that can sense and affect its environment in pursuit of a goal. LLM agents are usually software systems that equip LLMs with tools they can use to understand and manipulate their environment to complete tasks on behalf of their users. Often these agents act in computer environments, where they can browse the web, write code, and manipulate the file system.
For example, Magentic-One is a popular generalist agent developed by my collaborators here at MSR. It is designed as a multi-agent system, which is a useful programming abstraction that delegates certain capabilities to subagents. Specifically, in Magentic-One, these subagents are the Coder, Terminal, Web Surfer, and File Surfer, all of which are coordinated by a top-level Orchestrator agent.

Now let’s imagine you ask Magentic-One to solve a git-related task. First, the Orchestrator must decide whether to delegate that task to the Terminal agent or the Web Surfer agent. But when building a system like Magentic-One, we can evaluate its behavior on such tasks and fix issues by adjusting any part of the system.

So, for example, we can provide in-context examples to the Orchestrator if it decides to delegate to the wrong subagent. Likewise, we can adjust the tools and prompts of these subagents directly. In this way, Magentic-One is a vertically integrated system.

But in the past year, the Model Context Protocol, or MCP, has exploded in popularity. MCP enables developers to bundle their tools into a server that can be easily shared and consumed by LLM agents. Most popular LLM agents like Claude Code, Cursor, and GitHub Copilot already support MCP servers. This lets any user extend their agent at runtime, breaking the assumptions of vertical integration.

Now while this horizontal extensibility is exciting in principle, in practice, we observe that it can actually reduce LLM agents’ performance. We call this phenomenon tool-space interference.

In order to study tool-space interference, we developed MCP Interviewer, a CLI tool that automatically analyzes MCP servers, collecting descriptive statistics like the number of tools they provide, the depth and length of those tools’ schemas, and many more features. It can also use an LLM to generate a functional test plan that invokes each of the server’s tools to test that they behave as expected. You can also use MCP Interviewer to do qualitative LLM-as-a-judge evaluation of the server.

We’re excited that MSR enables us to share these tools with the world, and we’ve open sourced the MCP Interviewer on GitHub.

Back to the research. We collected nearly 1,500 real MCP servers from public registries, including Smithery.ai and Docker MCP Hub. We then ran the MCP Interviewer on each of these servers and analyzed the results, which we lay out in detail in our blog post on the MSR blog.

To recap our main findings, we identified a few common issues that can cause tool-space interference. First of all is tool name collisions. Two tools cannot have the same name, and LLM provider APIs will reject requests if there are name collisions between tools. MCP provides no formal guidance on namespacing, and so clients have had to each develop their own strategies like prefixing the server name before the tool name. Beyond exact collisions, though, tool names can also have significant semantic overlap like “search,” “web_search,” “bing_search,” and “google_search.” This can also confuse agents.

Next, we identified servers that expose too many tools. OpenAI’s API accepts a maximum of 128 tools, and their documentation recommends keeping that number well below 20. But we observe many servers above this 20-tool threshold.

Long contexts can also degrade LLM tool-calling performance, and MCP provides no limit on the length of tool responses. We identified some tools that returned more than 128,000 tokens in a single response, overflowing the available context of models like GPT-4o and reducing the number of possible tool calls for other long-context models like Gemini.

Finally, different models need to be prompted differently. For example, OpenAI recommends providing in-context examples of tool calls for chat completion models but discourages them for reasoning models. An MCP server generally does not know what model is connected to its client, and so its tool descriptions may work better for some models than others.

So what can you do?

As a user of MCP servers, you can use the MCP Interviewer tool to test servers before using them. As the developer of an MCP client, you can intercept long tool responses before submitting them to your LLM provider. As an MCP server developer, you should expose as few tools as possible, have short tool responses, unique and descriptive tool names, and you should report what models and clients you tested your server with. MCP marketplaces should also test uploaded servers and report their findings and even reject servers not meeting certain minimum criteria like exceeding maximum tool counts.

To learn more, please read our blog post and check out the MCP Interviewer on GitHub.