Efficient long-context inference is critical as large language models (LLMs) adopt context windows of ranging from 128K to 1M tokens. However, the growing key-value (KV) cache and the high computational complexity of attention create significant bottlenecks in memory usage and latency. In this paper, we find that attention in diverse long-context tasks exhibits sparsity, and LLMs implicitly "know" which tokens can be dropped or evicted at the head level after the pre-filling stage. Based on this insight, we propose Self-Attention Guided Eviction~(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache, enabling efficient inference with the reduced cache. Evaluations on LongBench and three long-context LLMs (Llama3.1-8B-Instruct-128k, Llama3-8B-Prolong-512k-Instruct, and Qwen2.5-7B-Instruct-128k) show that SAGE-KV maintains accuracy comparable to full attention while significantly improving efficiency. Specifically, SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.

本研究解决了大型语言模型在长文本推理中因键值缓存和计算复杂性导致的内存和延迟瓶颈问题。通过发现长文本任务中的注意力稀疏性，提出了一种新的自注意力引导的缓存驱逐方法（SAGE-KV），显著提高了内存效率并保持了与完整注意力相当的准确性。实验结果表明，SAGE-KV在多个长文本模型上实现了4倍的内存效率提升和更优的准确性。

大型语言模型知道如何丢弃：自注意力引导的键值缓存驱逐以实现高效的长文本推理