Following the success of dot-product attention in Transformers, numerous
approximations have been recently proposed to address its quadratic complexity
with respect to the input length. While these variants are memory and compute
efficient, it is not possible to directly use them with popular pre-trained
language models trained using vanilla attention, without an expensive
corrective pre-training stage. In this work, we propose a simple yet highly
accurate approximation for vanilla attention. We process the queries in chunks,
and for each query, compute the top-$k$ scores with respect to the keys. Our
approach offers several advantages: (a) its memory usage is linear in the input
size, similar to linear attention variants, such as Performer and RFA (b) it is
a drop-in replacement for vanilla attention that does not require any
corrective pre-training, and (c) it can also lead to significant memory savings
in the feed-forward layers after casting them into the familiar query-key-value
framework. We evaluate the quality of top-$k$ approximation for multi-head
attention layers on the Long Range Arena Benchmark, and for feed-forward layers
of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to
accuracy that is nearly-identical to vanilla attention in multiple setups
including training from scratch, fine-tuning, and zero-shot inference.

本文介绍了一种简单而高效的用于 vanilla attention 的逼近算法，基于对查询进行分块的计算，在多个数据集上的评估表明其准确性接近于 vanilla attention。