We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft’s inference infrastructure.
Our goal is to bridge the gap between deployed AI models and underlying hardware through a holistic, full-stack approach. We leverage not only the diversity across workloads (e.g., agentic vs. non-agentic, stringent vs. relaxed latency requirements), model architectures and hardware platforms, but also the unique characteristics of each layer. By tailoring optimizations to each layer’s strengths and constraints, we achieve higher throughput per GPU, reduced cost per inference, and more predictable latency.

Why it matters
This research provides the critical “glue” that connects AI workloads to Microsoft’s GPU fleet. By deeply understanding every layer of the inference stack, from model architectures and workloads to the underlying hardware architectures, we enable a symbiotic relationship between software and hardware. This alignment ensures workloads fully exploit system-level optimizations, while our GPU infrastructure adapts intelligently to evolving demands. The result: a more efficient, cost-effective, and high-performance inference platform powering Microsoft’s AI services at scale.