Large language models (LLMs) with billions of parameters demonstrate
impressive performance. However, the widely used Multi-Head Attention (MHA) in
LLMs incurs substantial computational and memory costs during inference. While
some efforts have optimized attention mechanisms by pruning heads or sharing
parameters among heads, these methods often lead to performance degradation or
necessitate substantial continued pre-training costs to restore performance.
Based on the analysis of attention redundancy, we design a Decoupled-Head
Attention (DHA) mechanism. DHA adaptively configures group sharing for key
heads and value heads across various layers, achieving a better balance between
performance and efficiency. Inspired by the observation of clustering similar
heads, we propose to progressively transform the MHA checkpoint into the DHA
model through linear fusion of similar head parameters step by step, retaining
the parametric knowledge of the MHA checkpoint. We construct DHA models by
transforming various scales of MHA checkpoints given target head budgets. Our
experiments show that DHA remarkably requires a mere 0.25\% of the original
model's pre-training budgets to achieve 97.6\% of performance while saving 75\%
of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5$\times$
training acceleration, a maximum of 13.93\% performance improvement under
0.01\% pre-training budget, and 4\% relative improvement under 0.05\%
pre-training budget.

通过分析注意力冗余，设计了一种解耦式头部注意力机制（Decoupled-Head Attention，DHA），达到性能和效率之间的更好平衡，通过逐步线性融合类似头部参数来将 Multi-Head Attention（MHA）模型转换为 DHA 模型，实现了预训练预算的极大节约和高性能的平衡。