Generative large language models (LLMs) have opened up numerous novel
possibilities, but due to their significant computational requirements their
ubiquitous use remains challenging. Some of the most useful applications
require processing large numbers of samples at a time and using long contexts,
both significantly increasing the memory communication load of the models. We
introduce SparQ Attention, a technique for increasing the inference throughput
of LLMs by reducing the memory bandwidth requirements within the attention
blocks through selective fetching of the cached history. Our proposed technique
can be applied directly to off-the-shelf LLMs during inference, without
requiring any modification to the pre-training setup or additional fine-tuning.
We show how SparQ Attention can decrease the attention memory bandwidth
requirements up to eight times without any loss in accuracy by evaluating Llama
2 and Pythia models on a wide range of downstream tasks.

通过选择性提取缓存历史记录，使用 SparQ Attention 技术可以提高大型语言模型的推理吞吐量，减少注意力块中的内存带宽需求，同时无需修改预训练设置或进行额外的微调，通过在多个下游任务上评估 Llama 2 和 Pythia 模型，展示了 SparQ Attention 如何在不损失准确性的情况下降低注意力内存带宽需求最多八倍。

SparQ 注意力：高带宽效率的 LLM 推理

SparQ Attention: Bandwidth-Efficient LLM Inference

Multi-head attention layers, as used in the Transformer neural sequence
model, are a powerful alternative to RNNs for moving information across and
between sequences. While training these layers is generally fast and simple,
due to parallelizability across the length of the sequence, incremental
inference (where such paralleization is impossible) is often slow, due to the
memory-bandwidth cost of repeatedly loading the large "keys" and "values"
tensors. We propose a variant called multi-query attention, where the keys and
values are shared across all of the different attention "heads", greatly
reducing the size of these tensors and hence the memory bandwidth requirements
of incremental decoding. We verify experimentally that the resulting models can
indeed be much faster to decode, and incur only minor quality degradation from
the baseline.

本文提出了一种多查询关注机制，使用这种机制可以降低增量解码的内存需求，并通过实验验证了这种关注机制可以使解码速度更快，同时只会导致较小的质量损失。