This research paper was presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.
As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.
To address these challenges, we introduce a prompt-compression method in our paper, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab),” presented at EMNLP 2023 (opens in new tab). Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.
LLMLingua’s method and evaluation
To develop LLMLingua’s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs’ generated data. We did this through instruction tuning.
To assess LLMLingua’s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.
During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI’s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper.
LLMLingua is robust, cost-effective, efficient, and recoverable
LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the ¼-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua’s score was 82.61 under the ½-shot constraint, outperforming the standard prompt result of 81.8.
LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM’s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.
What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.
Enhancing the user experience and looking ahead
LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab), a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs.
For the long-term, we have proposed LongLLMLingua, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua’s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We’re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.
Learn more about our work on the LLMLingua (opens in new tab) page.