Following the success of dot-product attention in Transformers, numerous
approximations have been recently proposed to address its quadratic complexity
with respect to the input length. While these variants are memory and compute
efficient, it is not possible to directly use them with popular pre-trained
language models trained using vanilla attention, without an expensive
corrective pre-training stage. In this work, we propose a simple yet highly
accurate approximation for vanilla attention. We process the queries in chunks,
and for each query, compute the top-$k$ scores with respect to the keys. Our
approach offers several advantages: (a) its memory usage is linear in the input
size, similar to linear attention variants, such as Performer and RFA (b) it is
a drop-in replacement for vanilla attention that does not require any
corrective pre-training, and (c) it can also lead to significant memory savings
in the feed-forward layers after casting them into the familiar query-key-value
framework. We evaluate the quality of top-$k$ approximation for multi-head
attention layers on the Long Range Arena Benchmark, and for feed-forward layers
of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to
accuracy that is nearly-identical to vanilla attention in multiple setups
including training from scratch, fine-tuning, and zero-shot inference.

本文介绍了一种简单而高效的用于 vanilla attention 的逼近算法，基于对查询进行分块的计算，在多个数据集上的评估表明其准确性接近于 vanilla attention。

基于 Top-k 注意力的内存高效 Transformer

Memory-efficient Transformers via Top-$k$ Attention

Attention based Transformer architecture has enabled significant advances in
the field of natural language processing. In addition to new pre-training
techniques, recent improvements crucially rely on working with a relatively
larger embedding dimension for tokens. Unfortunately, this leads to models that
are prohibitively large to be employed in the downstream tasks. In this paper
we identify one of the important factors contributing to the large embedding
size requirement. In particular, our analysis highlights that the scaling
between the number of heads and the size of each head in the current
architecture gives rise to a low-rank bottleneck in attention heads, causing
this limitation. We further validate this in our experiments. As a solution we
propose to set the head size of an attention unit to input sequence length, and
independent of the number of heads, resulting in multi-head attention layers
with provably more expressive power. We empirically show that this allows us to
train models with a relatively smaller embedding dimension and with better
performance scaling.

本文提出了一种多头注意力机制的改进方法，将注意头的大小设置为输入序列长度，从而使注意机制的表达能力更强，能够在较小的嵌入维度下训练模型，并提高模型的性能。

多头注意力模型中的低秩瓶颈

Low-Rank Bottleneck in Multi-head Attention Models

Multi-head attention layers, as used in the Transformer neural sequence
model, are a powerful alternative to RNNs for moving information across and
between sequences. While training these layers is generally fast and simple,
due to parallelizability across the length of the sequence, incremental
inference (where such paralleization is impossible) is often slow, due to the
memory-bandwidth cost of repeatedly loading the large "keys" and "values"
tensors. We propose a variant called multi-query attention, where the keys and
values are shared across all of the different attention "heads", greatly
reducing the size of these tensors and hence the memory bandwidth requirements
of incremental decoding. We verify experimentally that the resulting models can
indeed be much faster to decode, and incur only minor quality degradation from
the baseline.

本文提出了一种多查询关注机制，使用这种机制可以降低增量解码的内存需求，并通过实验验证了这种关注机制可以使解码速度更快，同时只会导致较小的质量损失。