In this study, we introduce adaptive KV cache compression, a plug-and-play
method that reduces the memory footprint of generative inference for Large
Language Models (LLMs). Different from the conventional KV cache that retains
key and value vectors for all context tokens, we conduct targeted profiling to
discern the intrinsic structure of attention modules. Based on the recognized
structure, we then construct the KV cache in an adaptive manner: evicting
long-range contexts on attention heads emphasizing local contexts, discarding
non-special tokens on attention heads centered on special tokens, and only
employing the standard KV cache for attention heads that broadly attend to all
tokens. Moreover, with the lightweight attention profiling used to guide the
construction of the adaptive KV cache, FastGen can be deployed without
resource-intensive fine-tuning or re-training. In our experiments across
various asks, FastGen demonstrates substantial reduction on GPU memory
consumption with negligible generation quality loss. We will release our code
and the compatible CUDA kernel for reproducibility.

通过自适应 KV 缓存压缩的插拔式方法，我们引入了一种减少大型语言模型（LLM）生成推理内存占用的方法。通过有针对性的分析注意力模块的内在结构，我们构建自适应 KV 缓存：针对局部上下文的注意力头强调接触范围短的上下文，针对特殊标记的注意力头中心化的丢弃非特殊标记，只有广泛关注所有标记的注意力头才使用标准 KV 缓存。此外，通过轻量级的注意力分析引导自适应 KV 缓存的构建，FastGen 不需要资源密集的微调或重新训练。在各种场景的实验中，FastGen 在 GPU 内存消耗方面显著减少，同时几乎没有生成质量损失。我们将发布用于重现的代码和兼容的 CUDA 内核。

模型指导的内容丢弃方法：用于大型语言模型的自适应 KV 缓存压缩

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Generative Pre-trained Transformer models, known as GPT or OPT, set
themselves apart through breakthrough performance across complex language
modelling tasks, but also by their extremely high computational and storage
costs. Specifically, due to their massive size, even inference for large,
highly-accurate GPT models may require multiple performant GPUs, which limits
the usability of such models. While there is emerging work on relieving this
pressure via model compression, the applicability and performance of existing
compression techniques is limited by the scale and complexity of GPT models. In
this paper, we address this challenge, and propose GPTQ, a new one-shot weight
quantization method based on approximate second-order information, that is both
highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT
models with 175 billion parameters in approximately four GPU hours, reducing
the bitwidth down to 3 or 4 bits per weight, with negligible accuracy
degradation relative to the uncompressed baseline. Our method more than doubles
the compression gains relative to previously-proposed one-shot quantization
methods, preserving accuracy, allowing us for the first time to execute an 175
billion-parameter model inside a single GPU for generative inference. Moreover,
we also show that our method can still provide reasonable accuracy in the
extreme quantization regime, in which weights are quantized to 2-bit or even
ternary quantization levels. We show experimentally that these improvements can
be leveraged for end-to-end inference speedups over FP16, of around 3.25x when
using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones
(NVIDIA A6000). The implementation is available at
this https URL

本研究提出了 GPTQ 一种新的一次性量化方法，可以在 4 个 GPU 小时内将 GPT 模型的参数数量降至 1750 亿，每个权重只需使用 3 到 4 个比特位即可恢复几乎与未压缩基线相同的准确性，在单个 GPU 内执行 1750 亿参数模型，快于使用 FP16 格式的 GPU，且可提供 3.25 倍至 4.5 倍的推理加速度。