The advent of pre-trained large language models (LLMs) has revolutionized
various natural language processing tasks. These models predominantly employ an
auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to
eliminate redundant calculations for previous tokens. Nevertheless, as context
lengths and batch sizes increase, the linear expansion in memory footprint of
KV caches becomes a key bottleneck of LLM deployment, which decreases
generation speeds significantly. To mitigate this issue, previous techniques
like multi-query attention (MQA) and grouped-query attention (GQA) have been
developed, in order to reduce KV heads to accelerate inference with comparable
accuracy to multi-head attention (MHA). Despite their effectiveness, existing
strategies for compressing MHA often overlook the intrinsic properties of the
KV caches. In this work, we explore the low-rank characteristics of the KV
caches and propose a novel approach for compressing KV heads. In particular, we
carefully optimize the MHA-to-GQA transformation to minimize compression error,
and to remain compatible with rotary position embeddings (RoPE), we also
introduce specialized strategies for key caches with RoPE. We demonstrate that
our method can compress half or even three-quarters of KV heads while
maintaining performance comparable to the original LLMs, which presents a
promising direction for more efficient LLM deployment in resource-constrained
environments.

在本文中，我们探索了 Key-Value 缓存的低秩特性，并提出了一种压缩 Key-Value 头部的新方法，该方法在最小化压缩误差的同时保持与原始大语言模型相当的性能，为在资源受限环境中更高效的大语言模型部署提供了一种有前途的方向。

LLM 中高效压缩 KV 头

Effectively Compress KV Heads for LLM

He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the
V and P (post-attention projection) linear layers, which reduces the total
number of weights. However, this scheme is only applicable to MHA (multi-head
attention), but not for MQA (multi-query attention) and GQA (grouped-query
attention). The latter schemes are used by many popular LLMs such as Llama 2,
Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes
mathematically equivalent versions that are suitable for MQA and GQA. For
example, removing Q and P from a skipless version of Mistral-7B would remove
15% of its weights (and thus reduce its compute and memory complexity). See
arXiv:2402.13388 and this https URL for
code and more transformer tricks.

使用等效的版本适用于多查询关注和分组查询关注的无跳过变压器，从而降低其计算和内存复杂性。

Transformer 技巧：去除跳过机制的权重

Transformer tricks: Removing weights for skipless transformers

Multi-query attention (MQA), which only uses a single key-value head,
drastically speeds up decoder inference. However, MQA can lead to quality
degradation, and moreover it may not be desirable to train a separate model
just for faster inference. We (1) propose a recipe for uptraining existing
multi-head language model checkpoints into models with MQA using 5% of original
pre-training compute, and (2) introduce grouped-query attention (GQA), a
generalization of multi-query attention which uses an intermediate (more than
one, less than number of query heads) number of key-value heads. We show that
uptrained GQA achieves quality close to multi-head attention with comparable
speed to MQA.

通过增加中间的键值头数目，我们提出了一种组合查询注意力 (GQA) 的方法，它是多查询注意力 (MQA) 的推广，能够实现训练速度和质量之间的平衡。