{"id":1150288,"date":"2025-10-22T10:22:38","date_gmt":"2025-10-22T17:22:38","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1150288"},"modified":"2025-11-07T05:24:36","modified_gmt":"2025-11-07T13:24:36","slug":"system%e2%80%91level-innovation-for-inference-at-scale","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/system%e2%80%91level-innovation-for-inference-at-scale\/","title":{"rendered":"System\u2011level innovation for inference at scale\u00a0"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"720\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720.jpg\" class=\"attachment-full size-full\" alt=\"M365 Research banner: network of connected points\" style=\"\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-300x113.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1024x384.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-768x288.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1536x576.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1600x600.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-240x90.jpg 240w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 \">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/efficient-ai\/\" class=\"icon-link icon-link--reverse mb-2\" data-bi-cN=\"Efficient AI team\">\n\t\t\t\t\t\t\t\t\t<span class=\"c-glyph glyph-chevron-left\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t\t\t\tEfficient AI team\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"system-level-innovation-for-inference-at-scale\">System\u2011level innovation for inference at scale&nbsp;<\/h1>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft&#8217;s inference infrastructure.<\/p>\n\n\n\n<p>Our goal is to bridge the gap between deployed AI models and underlying hardware through a holistic, full-stack approach. We leverage not only the <strong>diversity across workloads<\/strong> (e.g., agentic vs. non-agentic, stringent vs. relaxed latency requirements), <strong>model architectures<\/strong> and <strong>hardware platforms<\/strong>, but also the unique characteristics of each layer. By tailoring optimizations to each layer&#8217;s strengths and constraints, we achieve higher throughput per GPU, reduced cost per inference, and more predictable latency.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1471\" height=\"531\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425.jpg\" alt=\"Example of routing and scheduling strategies for LLM inference: SageServe, our holistic system for serving LLM requests with a wide range of performance objectives by leveraging heterogeneity across the stack, and FairServe, our application-aware scheduler.\" class=\"wp-image-1151738\" style=\"width:826px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425.jpg 1471w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425-300x108.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425-1024x370.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425-768x277.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/Screenshot-2025-10-12-192425-240x87.jpg 240w\" sizes=\"auto, (max-width: 1471px) 100vw, 1471px\" \/><figcaption class=\"wp-element-caption\">Example of routing and scheduling strategies for LLM inference: <a href=\"https:\/\/arxiv.org\/abs\/2502.14617\">SageServe<\/a>, our holistic system for serving LLM requests with a wide range of performance objectives by leveraging heterogeneity across the stack, and <a href=\"https:\/\/arxiv.org\/pdf\/2411.15997\">FairServe<\/a>, our application-aware scheduler.<\/figcaption><\/figure>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading h4\" id=\"why-it-matters\">Why it matters<\/h2>\n\n\n\n<p>This research provides the critical &#8220;glue&#8221; that connects AI workloads to Microsoft&#8217;s GPU fleet. By deeply understanding every layer of the inference stack, from model architectures and workloads to the underlying hardware architectures, we enable a symbiotic relationship between software and hardware. This alignment ensures workloads fully exploit system-level optimizations, while our GPU infrastructure adapts intelligently to evolving demands. The result: a more efficient, cost-effective, and high-performance inference platform powering Microsoft&#8217;s AI services at scale.<\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft&#8217;s inference infrastructure. Our goal is to bridge the gap between deployed AI models and underlying hardware through a holistic, full-stack approach. [&hellip;]<\/p>\n","protected":false},"featured_media":1045266,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1150288","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[1018440,1041954,1041966,1135010,1135014,1136484,1151270],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Anjaly Parayil","user_id":41215,"people_section":"Related people","alias":"aparayil"},{"type":"user_nicename","display_name":"Spyridon (Spyros) Mastorakis","user_id":43994,"people_section":"Related people","alias":"smastorakis"},{"type":"user_nicename","display_name":"Ankur Mallick","user_id":42441,"people_section":"Related people","alias":"ankurmallick"},{"type":"user_nicename","display_name":"Victor Ruehle","user_id":41027,"people_section":"Related people","alias":"virueh"},{"type":"user_nicename","display_name":"Renee St. Amant","user_id":43080,"people_section":"Related people","alias":"reneestamant"},{"type":"user_nicename","display_name":"Srikant Bharadwaj","user_id":41644,"people_section":"Related people","alias":"srbharadwaj"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":33,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150288\/revisions"}],"predecessor-version":[{"id":1155059,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150288\/revisions\/1155059"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1045266"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1150288"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1150288"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1150288"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1150288"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1150288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}