Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.

LLMLingua是一种粗粒度到细粒度的提示压缩方法，利用预算控制器、基于令牌级的迭代压缩算法和基于指令调整的语言模型分布对齐方法，实现高压缩率下语义完整性的维持，有效加速模型推理并降低成本。在多个不同场景的数据集上的实验和分析表明，该方法在性能上达到了最先进的水平，并且能够在保证性能损失很小的情况下进行高达20倍的压缩。

LLMLingua：压缩大型语言模型推理加速的提示