System‑level innovation for inference at scale
We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft’s inference…
Efficient AI
Reimagining AI efficiency from GPU kernels to context engineering to power Copilot-scale intelligence.
Tell me when: Building agents that can wait, monitor, and act
SentinelStep enables AI agents to handle monitoring tasks that run for hours or days, like watching for emails or tracking prices. It works by managing when agents should check and their context, avoiding wasted resources…
Distant conversational speech recognition: Challenges and Opportunities
State-of-the-art ASR systems excel on close-talk benchmarks but struggle with far-field conversational speech, where error rates remain above 20%. Current benchmark datasets inadequately assess generalization across domains and real-world conditions, often relying on oracle segmentation…