{"id":995736,"date":"2024-01-04T09:02:45","date_gmt":"2024-01-04T17:02:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=995736"},"modified":"2024-01-08T20:39:23","modified_gmt":"2024-01-09T04:39:23","slug":"splitwise-improves-gpu-usage-by-splitting-llm-inference-phases","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/splitwise-improves-gpu-usage-by-splitting-llm-inference-phases\/","title":{"rendered":"Splitwise improves GPU usage by splitting LLM inference phases"},"content":{"rendered":"\n<p>The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient\u2014enabling it to serve more queries faster under the same power budget\u2014can have very tangible benefits to both cloud providers and users.<\/p>\n\n\n\n<p>One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1.jpg\" alt=\"An example of the generative LLM inference process and the two phases associated with it. The initial prompt is \u201cWhich is better, pizza or burger?\u201d and it generates the word \u201cPizza\u201d. The token generation phase generates the words\/tokens: \u201cis\u201d, \u201cbetter\u201d, and \u201c.\u201d. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency.\" class=\"wp-image-996810\" style=\"width:600px\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1. An example of the generative LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive. <\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"splitting-the-phases-with-splitwise\">Splitting the phases with Splitwise<\/h2>\n\n\n\n<p>At <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/azrs\">Azure Research \u2013 Systems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/splitwise-efficient-generative-llm-inference-using-phase-splitting\/\" target=\"_blank\" rel=\"noreferrer noopener\">Splitwise: Efficient Generative LLM Inference Using Phase Splitting<\/a>,\u201d details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase. &nbsp;&nbsp;<\/p>\n\n\n\n<p>To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2.jpg\" alt=\"A high-level diagram of Splitwise architecture. Machines maintained in different pools are dedicated to the corresponding phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency. \" class=\"wp-image-997953\" style=\"width:600px\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/SplitwiseFIG2-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 2. A high-level diagram of the Splitwise architecture. Machines maintained in different pools are dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency. <\/figcaption><\/figure>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"tests-show-splitwise-maximizes-throughput-while-lowering-costs\">Tests show Splitwise maximizes throughput while lowering costs<\/h2>\n\n\n\n<p>To evaluate its performance, we used Splitwise to design clusters with different types of GPUs, including NVIDIA DGX-A100 and DGX-H100, while optimizing cost, power, and throughput under specific latency service level agreements (SLAs) for each query. Table 1 shows the machine types we used for each cluster design. Our application of Splitwise encompassed two use cases: code and conversation using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/meta-llama\/Llama-2-70b\" target=\"_blank\" rel=\"noopener noreferrer\">Llama-2-70B<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2211.05100\" target=\"_blank\" rel=\"noopener noreferrer\">BLOOM-176B<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> LLMs.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2086\" height=\"361\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi.png\" alt=\"Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power. \" class=\"wp-image-996774\" style=\"object-fit:cover;width:900px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi.png 2086w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-300x52.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-1024x177.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-768x133.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-1536x266.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-2048x354.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/splitwise-blog-table-hi-240x42.png 240w\" sizes=\"auto, (max-width: 2086px) 100vw, 2086px\" \/><figcaption class=\"wp-element-caption\">Table 1. Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power. <\/figcaption><\/figure>\n\n\n\n<p>Our findings demonstrate that Splitwise successfully achieves our three goals of maximizing throughput, minimizing costs, and reducing power. Through our evaluation, we observed that the Splitwise cluster design can maximize throughput at the same cost compared with an A100 baseline cluster. Moreover, Splitwise delivers much higher throughput while operating within the same provisioned power constraints as the baseline cluster. Figure 3 shows that compared with Baseline-H100, we can achieve 1.4x higher throughput at 20 percent lower cost. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"627\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3.jpg\" alt=\"Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints. Splitwise-HH requires the least number of machines. Splitwise-HHcap provides the best throughput. Splitwise-AA is the cheapest option.\" class=\"wp-image-997947\" style=\"width:674px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3.jpg 1200w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3-300x157.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3-1024x535.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3-768x401.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/Splitwise_Figure3-240x125.jpg 240w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"looking-forward\">Looking forward<\/h2>\n\n\n\n<p>Splitwise marks a leap toward efficient, high-performance LLM deployments. By separating the prompt and token phases, we can unlock new potential in GPU use. Looking forward, we at Microsoft Azure envision tailored machine pools driving maximum throughput, reduced costs, and power efficiency, and we will continue to focus on making LLM inference efficient and sustainable.<\/p>\n\n\n\n<p>Our approach is now part of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/vllm-project\/vllm\" target=\"_blank\" rel=\"noopener noreferrer\">vLLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and can also be implemented with other frameworks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>This work was done in collaboration with our intern, Pratyush Patel from the University of Washington. We also appreciate the help and guidance of Suriya Kalivardhan, Gopi Kumar, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chetanb\/\" target=\"_blank\" rel=\"noreferrer noopener\">Chetan Bansal<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p> Expanded LLM use creates new demands on cloud GPU capacity. Splitwise presents an efficient solution by separating the two essential phases of LLM inference, achieving higher throughput within a limited power budget.<\/p>\n","protected":false},"author":42735,"featured_media":996810,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Esha Choukse","user_id":"40417"},{"type":"user_nicename","value":"Chaojie Zhang","user_id":"42705"},{"type":"user_nicename","value":"\u00cd\u00f1igo Goiri","user_id":"32102"},{"type":"user_nicename","value":"Aashaka Shah","user_id":"43056"},{"type":"user_nicename","value":"Saeed Maleki","user_id":"36131"},{"type":"user_nicename","value":"Rodrigo Fonseca","user_id":"40429"},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":"33393"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[264846],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-995736","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":["Computing foundations"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Esha Choukse","user_id":40417,"display_name":"Esha Choukse","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eschouks\/\" aria-label=\"Visit the profile page for Esha Choukse\">Esha Choukse<\/a>","is_active":false,"last_first":"Choukse, Esha","people_section":0,"alias":"eschouks"},{"type":"user_nicename","value":"Chaojie Zhang","user_id":42705,"display_name":"Chaojie Zhang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chaojiezhang\/\" aria-label=\"Visit the profile page for Chaojie Zhang\">Chaojie Zhang<\/a>","is_active":false,"last_first":"Zhang, Chaojie","people_section":0,"alias":"chaojiezhang"},{"type":"user_nicename","value":"\u00cd\u00f1igo Goiri","user_id":32102,"display_name":"&Iacute;&ntilde;igo Goiri","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/inigog\/\" aria-label=\"Visit the profile page for &Iacute;&ntilde;igo Goiri\">&Iacute;&ntilde;igo Goiri<\/a>","is_active":false,"last_first":"Goiri, \u00cd\u00f1igo","people_section":0,"alias":"inigog"},{"type":"user_nicename","value":"Aashaka Shah","user_id":43056,"display_name":"Aashaka Shah","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/aashakashah\/\" aria-label=\"Visit the profile page for Aashaka Shah\">Aashaka Shah<\/a>","is_active":false,"last_first":"Shah, Aashaka","people_section":0,"alias":"aashakashah"},{"type":"user_nicename","value":"Rodrigo Fonseca","user_id":40429,"display_name":"Rodrigo Fonseca","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rofons\/\" aria-label=\"Visit the profile page for Rodrigo Fonseca\">Rodrigo Fonseca<\/a>","is_active":false,"last_first":"Fonseca, Rodrigo","people_section":0,"alias":"rofons"},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":33393,"display_name":"Ricardo Bianchini","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ricardob\/\" aria-label=\"Visit the profile page for Ricardo Bianchini\">Ricardo Bianchini<\/a>","is_active":false,"last_first":"Bianchini, Ricardo","people_section":0,"alias":"ricardob"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"An example of the generative LLM inference process and the two phases associated with it. The initial prompt is \u201cWhich is better, pizza or burger?\u201d and it generates the word \u201cPizza\u201d. The token generation phase generates the words\/tokens: \u201cis\u201d, \u201cbetter\u201d, and \u201c.\u201d. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/NEWSplitwise-Jan-24-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"January 4, 2024","formattedExcerpt":"Expanded LLM use creates new demands on cloud GPU capacity. Splitwise presents an efficient solution by separating the two essential phases of LLM inference, achieving higher throughput within a limited power budget.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/995736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=995736"}],"version-history":[{"count":35,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/995736\/revisions"}],"predecessor-version":[{"id":997956,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/995736\/revisions\/997956"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/996810"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=995736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=995736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=995736"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=995736"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=995736"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=995736"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=995736"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=995736"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=995736"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=995736"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=995736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}