A promising approach to preserving model performance in linearized
transformers is to employ position-based re-weighting functions. However,
state-of-the-art re-weighting functions rely heavily on target sequence
lengths, making it difficult or impossible to apply them to autoregressive and
simultaneous tasks, where the target and sometimes even the input sequence
length are unknown. To address this issue, we propose Learned Proportions
(LeaP) and LeaPformers. Our contribution is built on two major components.
First, we generalize the dependence on explicit positional representations and
sequence lengths into dependence on sequence proportions for re-weighting.
Second, we replace static positional representations with dynamic proportions
derived via a compact module, enabling more flexible attention concentration
patterns. We evaluate LeaPformer against eight representative efficient
transformers on the Long-Range Arena benchmark, showing that LeaPformer
achieves the best quality-throughput trade-off, as well as LeaPformer to
Wikitext-103 autoregressive language modeling and simultaneous speech-to-text
translation for two language pairs, achieving competitive results.

通过使用基于位置的重加权函数，我们提出了 Learned Proportions (LeaP) 和 LeaPformers 模型，通过依赖于比例序列重加权的方法和动态比例生成模块，实现了更灵活的注意力集中模式，成功应用于线性化变压器模型，在多个任务上达到了最佳质量和吞吐量的平衡，取得了竞争性的结果。

LeaPformer：通过学习比例实现线性变压器的自回归和同时任务

LeaPformer: Enabling Linear Transformers for Autoregressive and  Simultaneous Tasks via Learned Proportions

As transformer-based language models are trained on increasingly large
datasets and with vast numbers of parameters, finding more efficient
alternatives to the standard Transformer has become very valuable. While many
efficient Transformers and Transformer alternatives have been proposed, none
provide theoretical guarantees that they are a suitable replacement for the
standard Transformer. This makes it challenging to identify when to use a
specific model and what directions to prioritize for further investigation. In
this paper, we aim to understand the capabilities and limitations of efficient
Transformers, specifically the Sparse Transformer and the Linear Transformer.
We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT)
prompts and follow previous works to model them as Dynamic Programming (DP)
problems. Our results show that while these models are expressive enough to
solve general DP tasks, contrary to expectations, they require a model size
that scales with the problem size. Nonetheless, we identify a class of DP
problems for which these models can be more efficient than the standard
Transformer. We confirm our theoretical results through experiments on
representative DP tasks, adding to the understanding of efficient Transformers'
practical strengths and weaknesses.

我们研究了基于 Transformer 的语言模型，特别关注了 Sparse Transformer 和 Linear Transformer 的推理能力，并发现它们对一类动态规划问题更加有效。

有效的 Transformer 是否真的节省计算量？

Do Efficient Transformers Really Save Computation?

In pursuit of faster computation, Efficient Transformers demonstrate an
impressive variety of approaches -- models attaining sub-quadratic attention
complexity can utilize a notion of sparsity or a low-rank approximation of
inputs to reduce the number of attended keys; other ways to reduce complexity
include locality-sensitive hashing, key pooling, additional memory to store
information in compacted or hybridization with other architectures, such as
CNN. Often based on a strong mathematical basis, kernelized approaches allow
for the approximation of attention with linear complexity while retaining high
accuracy. Therefore, in the present paper, we aim to expand the idea of
trainable kernel methods to approximate the self-attention mechanism of the
Transformer architecture.

本文旨在将可训练的核方法的思想扩展到逼近 Transformer 架构的自注意机制，以实现更快的计算和更高的准确率。

可训练前馈核线性自注意力近似

Linear Self-Attention Approximation via Trainable Feedforward Kernel

The impressive progress in NLP techniques has been driven by the development
of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks
focus on tasks for one or two input sentences, there has been exciting work in
designing efficient techniques for processing much longer inputs. In this
paper, we present MuLD: a new long document benchmark consisting of only
documents over 10,000 tokens. By modifying existing NLP tasks, we create a
diverse benchmark which requires models to successfully model long-term
dependencies in the text. We evaluate how existing models perform, and find
that our benchmark is much more challenging than their `short document'
equivalents. Furthermore, by evaluating both regular and efficient
transformers, we show that models with increased context length are better able
to solve the tasks presented, suggesting that future improvements in these
models are vital for solving similar long document problems. We release the
data and code for baselines to encourage further research on efficient NLP
models.

MuLD 是一个以文档长度为 10,000 个标记的新型长文档基准，旨在测试自然语言处理任务在长文档上的性能和解决方法。研究结果表明，使用增加上下文长度的 Transformer 模型能更好地解决该基准中的任务，这为进一步研究提供了启示。