At Microsoft, our network engineers work across multiple systems, including topology views, telemetry dashboards, logs, incidents, tickets, and fragmented tools. They piece together signals from these sources to understand what’s happening during an incident, often under considerable time pressure.
But this kind of fragmentation slows down reasoning. Engineers spend more time navigating tools than diagnosing issues.
To address this, the Microsoft Infrastructure, Networking, and Tenant organization in Microsoft Digital, the company’s IT organization, is building Infrastructure Graph (IGraph), a unified platform that brings topology, real-time telemetry, and operational context into a single view.
On top of this foundation, agentic capabilities enable AI agents to reason across these signals, surfacing insights, explaining issues, and recommending next steps. This shifts the experience from exploring data to making decisions faster and with greater confidence.

“Engineers increasingly face fragmented visibility. We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”
Astha Sinha, product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital
This visualization layer and intelligence platform provides a view of our entire Microsoft enterprise network—including more than 20,000 on-premises devices across 900 sites worldwide—to instantly surface the most critical issues and offer proactive recommendations to our engineers.
“Engineers increasingly face fragmented visibility,” says Astha Sinha, a product manager in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “We wanted to unify live telemetry, topology, and context into one single intelligent visualization experience and show engineers what’s really important, so they don’t have to dive into oceans of data.”
Network insight at speed
IGraph displays the following in a single pane-of-glass view for a given site:
- Topology and dependency context: Visualizes routers, switches, access points, client devices, and their relationships, enriched with path and dependency awareness to localize impact areas
- Real-time health and telemetry insights: Surfaces live performance signals (utilization, errors, abnormal behavior) correlates directly onto the topology to highlight where the network is degraded or “running hot”
- Operational and incident context: Integrates incidents, tickets, and change signals into the graph, enabling engineers to understand what is happening and where and what systems are affected in a single view

“Fragmentation across operational data sources was only part of the problem. The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, then integrating it with real-time telemetry and topology to enable low-latency, context-aware reasoning in the agentic layer.”
Vinod Kumar Singh, principal software engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital
On top of this visualization layer, the team is building an agentic layer using Azure Foundry that allows AI agents to discover and use external tools and data sources.
Without IGraph agent, accessing data involves pulling from multiple existing sources, including servers and logs, with mixed latency (from minutes to hours). This fragmentation makes near-real-time reasoning almost impossible, as agents lack a unified, low-latency view of topology and telemetry.
“Fragmentation across operational data sources was only part of the problem,” says Vinod Kumar Singh, a principal software engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “The harder challenge was externalizing and structuring the implicit domain knowledge engineers rely on, the integrating it with real-time telemetry and topology to enable low latency, context-aware reasoning in the agentic layer.”
How IGraph works
The user starts in context. Say they’re on the IGraph UI for Building 32. They can already see the building topology, recent incidents, support tickets, and live health and performance metrics.
The engineer can ask a natural language question such as, “The internet is not working in Building 32—what’s going on?”
The AI agent begins reasoning across UI context (location, devices, open incidents), topology (involved devices and neighbors), historical metrics, and real-time device calls. It works with specialized MCP servers and agents to identify impacted devices, test live responsiveness, measure neighboring impact, verify data flow, and flag abnormal utilization or error trends.

“Engineers spend a lot of time firefighting. The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”
Abhijit Vijay, principal software engineer manager, Infrastructure, Networking, and Tenant team, Microsoft Digital
Using this context, IGraph pulls in the relevant logs, real-time telemetry, and incident history to complete the analysis.
Instead of raw metrics and hundreds of rows of data, the agent returns a clean summary that provides a view of the failing device, the health of neighboring devices, and the blast radius. It shows what’s broken, what’s still healthy, the likely causes, and next actions.
The engineer stays in one UI for all this, and isn’t forced to use different tools or manually correlate data.
“Engineers spend a lot of time firefighting,” says Abhijit Vijay, a principal software engineer manager on the team in Microsoft Digital. “The visualization layer gives them the view they need to quickly solve the incidents. It helps free up their time to engage in more systemic improvements on their applications.”
The impact of incident visibility
IGraph offers a new real-time telemetry layer that:
- Uses a UI that surfaces telemetry and topology by correlating data from upstream systems
- Decreases effective latency for users, enabling near-real-time insights (often within seconds)
- Provides near-real-time signals in the UI on health, performance, routing state, and neighboring device relationships

“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur.”
Nevedita Mallick, principal product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital
Combined, these capabilities give network engineers an up-to-the moment view of what’s happening across the network, before small issues can cascade into larger incidents.
By making live telemetry easier to access and interpret, IGraph helps teams move from reactive troubleshooting to proactive prevention.
“Our goal is to accelerate how network engineers understand what’s happening, enabling them to shift from reactive troubleshooting to proactive prevention—identifying and mitigating issues before they occur,” says Nevedita Mallick, a principal product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.
That speed and clarity are especially important for new engineers.

“The tool delivers value right away, especially for newer engineers. Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”
Manjiri Keskar, principal cloud network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital
Complex networks rely on unwritten knowledge and experience built up over time, which can slow onboarding and make troubleshooting harder than it needs to be. IGraph shortens that learning curve by making the network’s relationships and current state immediately visible.
“The tool delivers value right away, especially for newer engineers,” says Manjiri Keskar, a principal cloud network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Instead of having to piece things together, they get an instant view of the network that shows how devices are connected and displays the already-surfaced incidents directly on the graph.”
What’s next for IGraph Agent
Without IGraph Agent, network analysis is largely reactive.
Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.

“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent intent-driven systems that can provision, validate, and troubleshoot networks autonomously. Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”
Sonika Munde, senior network engineer, Infrastructure, Networking, and Tenant team, Microsoft Digital
Teams often address failures after customers have already felt the impact, instead of preventing issues by acting when early warning signs appear.
“Agentic AI is transforming networking DevOps from manual, reactive operations into intelligent, intent-driven systems that can provision, validate, and troubleshoot networks autonomously,” says Sonika Munde, a senior network engineer in the Infrastructure, Networking, and Tenant team in Microsoft Digital. “Looking ahead, it will power self-healing networks and dramatically accelerate buildouts, allowing engineers to focus on architecture, strategy, and innovation.”
That unified network intelligence will let IGraph Agent communicate with multiple lightweight agents that continuously analyze network conditions, dramatically compressing response times.
“What used to happen in hours will happen in minutes,” Munde says.
Now, the team is pushing further. One example is layering in weather intelligence to help engineers anticipate issues before they materialize, as big storms can trigger power fluctuations that ripple through the network. By visualizing this data, engineers can proactively communicate with customers and take mitigation steps that protect operational workloads.
Overall, IGraph lets teams focus on prevention. Engineers spend less time navigating dashboards and cross-checking data and more time detecting patterns and surfacing emerging risks. Manual analysis is reduced as the agent highlights insights in real time.

“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft.”
Jason Thompson, principal group product manager, Infrastructure, Networking, and Tenant team, Microsoft Digital
The technology is poised to go even further. IGraph will eventually help power self-healing networks and speed up network build-outs, freeing engineers to focus on architecture and innovation. The future vision for the tool includes fully automated predictive network intelligence across all Microsoft campuses, with agents that monitor, reason, recommend responses, and safely take action.
“By bringing telemetry, topology, and AI together in one intelligent layer, we’re turning fragmented signals into real-time intelligence so teams can move faster, act earlier, and protect the critical workloads that power Microsoft,” says Jason Thompson, a principal group product manager for the Infrastructure, Networking, and Tenant team in Microsoft Digital.

Key takeaways
To move from reactive operations to proactive AI-supported network management, we recommend starting with these steps:
- Start consolidating real-time telemetry into a single view. Even a lightweight dashboard is enough to prepare for AI-driven insights later.
- Identify high-frequency incident types to target for AI triage. Pick the most common or disruptive scenarios and map out what data engineers currently review for them.
- Document the decision logic your engineers use today. Before implementing AI, capture the human reasoning steps to help guide your approach.
- Pilot an agentic solution with one network segment or site. Start with one building, one lab, or a small testbed.

Related links

We’d like to hear from you!

