Generative large language models (LLMs) have opened up numerous novel
possibilities, but due to their significant computational requirements their
ubiquitous use remains challenging. Some of the most useful applications
require processing large numbers of samples at a time and using long contexts,
both significantly increasing the memory communication load of the models. We
introduce SparQ Attention, a technique for increasing the inference throughput
of LLMs by reducing the memory bandwidth requirements within the attention
blocks through selective fetching of the cached history. Our proposed technique
can be applied directly to off-the-shelf LLMs during inference, without
requiring any modification to the pre-training setup or additional fine-tuning.
We show how SparQ Attention can decrease the attention memory bandwidth
requirements up to eight times without any loss in accuracy by evaluating Llama
2 and Pythia models on a wide range of downstream tasks.

通过选择性提取缓存历史记录，使用 SparQ Attention 技术可以提高大型语言模型的推理吞吐量，减少注意力块中的内存带宽需求，同时无需修改预训练设置或进行额外的微调，通过在多个下游任务上评估 Llama 2 和 Pythia 模型，展示了 SparQ Attention 如何在不损失准确性的情况下降低注意力内存带宽需求最多八倍。