{"id":1030206,"date":"2024-05-08T09:00:00","date_gmt":"2024-05-08T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1030206"},"modified":"2024-05-08T10:52:37","modified_gmt":"2024-05-08T17:52:37","slug":"llm-profiling-guides-kv-cache-optimization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/llm-profiling-guides-kv-cache-optimization\/","title":{"rendered":"LLM profiling guides KV cache optimization"},"content":{"rendered":"\n<p class=\"has-text-align-center\"><strong><em>This research paper was presented at the <\/em><\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/iclr.cc\/\" target=\"_blank\" rel=\"noopener noreferrer\"><strong><em>12<sup>th<\/sup> International Conference on Learning Representations<\/em><\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong><em> (ICLR 2024), the premier conference dedicated to the advancement of deep learning.<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1401\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1.png\" alt=\"White ICLR logo to the left of the first page of the accepted paper, \u201cModel Tells You What to Discard: Adaptive KV Cache Compression for LLMs\u201d on a purple background.\" class=\"wp-image-1030227\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1.png 1401w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1401px) 100vw, 1401px\" \/><\/figure>\n\n\n\n<p>Large language models (LLMs) rely on complex internal mechanisms that require more memory than what is typically available to operate on standard devices. One such mechanism is the key-value (KV) cache, which stores and retrieves previously computed data, helping the model generate responses quickly without needing to recalculate information it has already processed. This method uses a substantial amount of memory because it keeps a large amount of this data readily accessible to enhance the model&#8217;s speed and efficiency. Consequently, the KV cache can become prohibitively large as the complexity of the tasks increases, sometimes requiring up to 320 GB for a single operation. To address this, we developed FastGen, a novel method aimed at reducing the memory demands for LLMs.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms\/\" data-bi-cN=\"Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Our paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs<\/a>,\u201d presented at ICLR 2024, we describe how FastGen optimizes the way LLMs store and access data, potentially cutting memory use by half while preserving their efficiency. This approach represents a significant step toward making sophisticated AI tools more accessible and affordable for broader applications.\u00a0We are delighted to share that this paper has been given an Honorable Mention for the\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blog.iclr.cc\/2024\/05\/06\/iclr-2024-outstanding-paper-awards\/\" target=\"_blank\" rel=\"noopener noreferrer\">Outstanding Paper Award<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"observations-of-the-kv-cache\">Observations of the KV cache<\/h2>\n\n\n\n<p>The development of FastGen is underpinned by our observations of how the KV cache functions. We first observed that not all the data in the KV cache is needed for LLMs to complete their required tasks, as shown in Figure 1. By providing the KV cache with the mechanism to discard unnecessary data, it is possible to significantly cut memory use. For example, some LLM modules don\u2019t require broad contexts to process input. For this, it is possible to construct a KV cache that removes data that contains less important long-range contexts, such as several sentences or paragraphs. Also, some LLM modules primarily attend only to special tokens, such as punctuation, for which it is possible to create a KV cache that retains only those tokens. Finally, some LLM modules broadly need all tokens, and for these we can employ the standard KV cache and store all words.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Another key observation in our study is that attention modules in different layers and positions in the LLM behave differently and need different preferences for their KV cache, as shown on the right in Figure 1.&nbsp;<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1401\" height=\"450\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1.png\" alt=\"Graphs depicting the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.\" class=\"wp-image-1030860\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1.png 1401w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1-300x96.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1-1024x329.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1-768x247.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ICLR-Figure1-1400x-450-1-240x77.png 240w\" sizes=\"auto, (max-width: 1401px) 100vw, 1401px\" \/><figcaption class=\"wp-element-caption\">Figure 1: These graphs depict the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fastgen-accounts-for-the-diversity-of-kv-cache-structures\">FastGen accounts for the diversity of KV cache structures<\/h2>\n\n\n\n<p>Because different KV caches have different structures, they need to be handled differently. We based the development of the FastGen algorithm on our observations, enabling it to categorize and optimize the data that is stored in a given KV cache. FastGen first analyzes the specific behaviors of different modules to understand their structures, a method called <em>profiling<\/em>. It then uses the results to adjust how data is stored in real-time, making the process more efficient. Our tests show that FastGen can reduce the amount of memory by 50% without sacrificing quality. Additional experiments, discussed in detail in our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms\/\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a>, confirm that the profiling process is crucial and significantly improves the efficiency of the KV cache.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-broader-picture\">The broader picture<\/h2>\n\n\n\n<p>Fueled by unprecedented advances in data handling and computational capabilities, LLM pretraining has emerged as a cornerstone of deep learning, transforming natural language processing tasks and continuously challenging our understanding of learning and cognition.<\/p>\n\n\n\n<p>However, greater capabilities can bring challenges. As models scale larger, customizing them for specific tasks can become more resource-intensive. At Microsoft Research, we are exploring different approaches to more efficient model editing. A critical strategy involves targeted model profiling, which identifies essential components of a model that align with predefined goals. This profiling informs precise model modifications, optimizing resource use and effectiveness.<\/p>\n\n\n\n<p>The two research projects we are presenting at ICLR 2024 support these goals. Both adopt the profile-then-edit paradigm to address different problems. FastGen reduces memory consumption. Our related work, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/tell-your-model-where-to-attend-post-hoc-attention-steering-for-llms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Post-hoc Attention Steering for LLMs (PASTA)<\/a>, focuses on better controllability. These approaches are designed to be resource-efficient, as they do not require tuning or back propagation. Looking ahead, our goal is to further develop these techniques to improve the resource-efficiency of LLM applications, making them more accessible to a wider audience.&nbsp;&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLMs rely on memory-intensive mechanisms like the key-value (KV) cache to store and quickly retrieve data. FastGen optimizes KV cache usage, reducing LLM memory demands by up to 50% while maintaining performance.<\/p>\n","protected":false},"author":37583,"featured_media":1030227,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Liyuan Liu","user_id":"43302"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1030206","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[1014303],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"White ICLR logo to the left of the first page of the accepted paper, \u201cModel Tells You What to Discard: Adaptive KV Cache Compression for LLMs\u201d on a purple background.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/05\/ModelDiscard-BlogHeroFeature-1400x788-1.png 1401w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Liyuan Liu and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"May 8, 2024","formattedExcerpt":"LLMs rely on memory-intensive mechanisms like the key-value (KV) cache to store and quickly retrieve data. FastGen optimizes KV cache usage, reducing LLM memory demands by up to 50% while maintaining performance.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1030206","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1030206"}],"version-history":[{"count":31,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1030206\/revisions"}],"predecessor-version":[{"id":1032420,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1030206\/revisions\/1032420"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1030227"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1030206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1030206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1030206"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1030206"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1030206"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1030206"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1030206"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1030206"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1030206"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1030206"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1030206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}