注意力是你需要的，但在大语言模型推理中并不需要全部注意力

Jul, 2024

注意力是你需要的，但在大语言模型推理中并不需要全部注意力

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini

TL;DR本文研究了在推理时省略 Llama-v2 模型的 MLP 和注意力层对性能的影响，填补了大语言模型推理效率提升的空白。研究发现，省略较深的注意力层仅会轻微降低性能，但可以显著加速推理。结果表明，去掉 13B Llama2 模型中 33% 的注意力层，平均性能仅下降 1.8%。

Abstract

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper atte