The advent of pre-trained large language models (LLMs) has revolutionized
various natural language processing tasks. These models predominantly employ an
auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to
eliminate redundant calculations for previous tokens. Nevertheless, as context
lengths and batch sizes increase, the linear expansion in memory footprint of
KV caches becomes a key bottleneck of LLM deployment, which decreases
generation speeds significantly. To mitigate this issue, previous techniques
like multi-query attention (MQA) and grouped-query attention (GQA) have been
developed, in order to reduce KV heads to accelerate inference with comparable
accuracy to multi-head attention (MHA). Despite their effectiveness, existing
strategies for compressing MHA often overlook the intrinsic properties of the
KV caches. In this work, we explore the low-rank characteristics of the KV
caches and propose a novel approach for compressing KV heads. In particular, we
carefully optimize the MHA-to-GQA transformation to minimize compression error,
and to remain compatible with rotary position embeddings (RoPE), we also
introduce specialized strategies for key caches with RoPE. We demonstrate that
our method can compress half or even three-quarters of KV heads while
maintaining performance comparable to the original LLMs, which presents a
promising direction for more efficient LLM deployment in resource-constrained
environments.

在本文中，我们探索了 Key-Value 缓存的低秩特性，并提出了一种压缩 Key-Value 头部的新方法，该方法在最小化压缩误差的同时保持与原始大语言模型相当的性能，为在资源受限环境中更高效的大语言模型部署提供了一种有前途的方向。