Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

本研究针对标准Transformer中自注意力机制的二次复杂性，提出了KV-Distill框架，以压缩长上下文的KV缓存，从而显著缩短表示并保持预训练模型能力。实验表明，KV-Distill在提取任务中的表现优于其他压缩技术，能够在不损失下游性能的情况下，减少上下文长度达99%。

KV-Distill：几乎无损可学习的上下文压缩方法用于大型语言模型