Natural language processing (NLP) made an impressive jump with the
introduction of Transformers. ChatGPT is one of the most famous examples,
changing the perception of the possibilities of AI even outside the research
community. However, besides the impressive performance, the quadratic time and
space complexity of Transformers with respect to sequence length pose
significant limitations for handling long sequences. While efficient
Transformer architectures like Linformer and Performer with linear complexity
have emerged as promising solutions, their theoretical understanding remains
limited. In this paper, we introduce Sumformer, a novel and simple architecture
capable of universally approximating equivariant sequence-to-sequence
functions. We use Sumformer to give the first universal approximation results
for Linformer and Performer. Moreover, we derive a new proof for Transformers,
showing that just one attention layer is sufficient for universal
approximation.

本文介绍了一种新的神经网络架构 Sumformer，可以近似等变序列到序列的函数。作者使用 Sumformer 在 Linformer 和 Performer 上实现了第一个通用的逼近结果，并提出了 Transformer 的新证明，仅需要一个注意力层即可实现通用逼近。

Sumformer: 高效 Transformer 的通用逼近

Sumformer: Universal Approximation for Efficient Transformers

Transformer models cannot easily scale to long sequences due to their O(N^2)
time and space complexity. This has led to Transformer variants seeking to
lower computational complexity, such as Longformer and Performer. While such
models have theoretically greater efficiency, their effectiveness on real NLP
tasks has not been well studied. We benchmark 7 variants of Transformer models
on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the
effect of pretraining and hyperparameter settings, to focus on their capacity
for long-range attention. Moreover, we present various methods to investigate
attention behaviors to illuminate model details beyond metric scores. We find
that the modified attention in long-range transformers has advantages on
content selection and query-guided decoding, but they come with previously
unrecognized drawbacks such as insufficient attention to distant tokens and
accumulated approximation error.

比较研究了多种 Transformer 模型的性能，发现长序列的改进版本在内容选择和查询引导解码方面有优势，但在处理远距离的信息和近似误差上有欠缺的地方。

长程变压器的自然语言处理任务效率

The NLP Task Effectiveness of Long-Range Transformers

Following the success of dot-product attention in Transformers, numerous
approximations have been recently proposed to address its quadratic complexity
with respect to the input length. While these variants are memory and compute
efficient, it is not possible to directly use them with popular pre-trained
language models trained using vanilla attention, without an expensive
corrective pre-training stage. In this work, we propose a simple yet highly
accurate approximation for vanilla attention. We process the queries in chunks,
and for each query, compute the top-$k$ scores with respect to the keys. Our
approach offers several advantages: (a) its memory usage is linear in the input
size, similar to linear attention variants, such as Performer and RFA (b) it is
a drop-in replacement for vanilla attention that does not require any
corrective pre-training, and (c) it can also lead to significant memory savings
in the feed-forward layers after casting them into the familiar query-key-value
framework. We evaluate the quality of top-$k$ approximation for multi-head
attention layers on the Long Range Arena Benchmark, and for feed-forward layers
of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to
accuracy that is nearly-identical to vanilla attention in multiple setups
including training from scratch, fine-tuning, and zero-shot inference.

本文介绍了一种简单而高效的用于 vanilla attention 的逼近算法，基于对查询进行分块的计算，在多个数据集上的评估表明其准确性接近于 vanilla attention。