M365 Research banner: network of connected points
Efficient AI team

System‑level innovation for inference at scale 

We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft’s inference infrastructure.

Our goal is to bridge the gap between deployed AI models and underlying hardware through a holistic, full-stack approach. We leverage not only the diversity across workloads (e.g., agentic vs. non-agentic, stringent vs. relaxed latency requirements), model architectures and hardware platforms, but also the unique characteristics of each layer. By tailoring optimizations to each layer’s strengths and constraints, we achieve higher throughput per GPU, reduced cost per inference, and more predictable latency.

Example of routing and scheduling strategies for LLM inference: SageServe, our holistic system for serving LLM requests with a wide range of performance objectives by leveraging heterogeneity across the stack, and FairServe, our application-aware scheduler.
Example of routing and scheduling strategies for LLM inference: SageServe, our holistic system for serving LLM requests with a wide range of performance objectives by leveraging heterogeneity across the stack, and FairServe, our application-aware scheduler.

Why it matters

This research provides the critical “glue” that connects AI workloads to Microsoft’s GPU fleet. By deeply understanding every layer of the inference stack, from model architectures and workloads to the underlying hardware architectures, we enable a symbiotic relationship between software and hardware. This alignment ensures workloads fully exploit system-level optimizations, while our GPU infrastructure adapts intelligently to evolving demands. The result: a more efficient, cost-effective, and high-performance inference platform powering Microsoft’s AI services at scale.