Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

本文解决了深度学习模型中多头注意力（MHA）在推理时的高成本问题，提出了一种名为MHA2MLA的数据高效微调方法，旨在从MHA过渡到DeepSeek的多头潜在注意力（MLA）。研究表明，MHA2MLA能够通过仅使用0.3%到0.6%的一小部分数据恢复性能，同时大幅降低推理成本，并在实际应用中显著压缩KV缓存。

朝向经济高效的推理：使DeepSeek的多头潜在注意力在任何基于Transformer的LLM中都可行