Tensor Attention, a multi-view attention that is able to capture high-order
correlations among multiple modalities, can overcome the representational
limitations of classical matrix attention. However, the $\Omega(n^3)$ time
complexity of tensor attention poses a significant obstacle to its practical
implementation in transformers, where $n$ is the input sequence length. In this
work, we prove that the backward gradient of tensor attention training can be
computed in almost linear $n^{1+o(1)}$ time, the same complexity as its forward
computation under a bounded entries assumption. We provide a closed-form
solution for the gradient and propose a fast computation method utilizing
polynomial approximation methods and tensor algebraic tricks. Furthermore, we
prove the necessity and tightness of our assumption through hardness analysis,
showing that slightly weakening it renders the gradient problem unsolvable in
truly subcubic time. Our theoretical results establish the feasibility of
efficient higher-order transformer training and may facilitate practical
applications of tensor attention architectures.

我们证明了张量注意力训练的反向梯度可以以几乎线性的 $n^{1+o (1)}$ 时间计算，同时提供了梯度的闭式解，并通过多项式逼近和张量代数技巧提出了一种快速计算方法。我们的理论结果证实了高阶 Transformer 训练的可行性，并可能促进张量注意力架构的实际应用。

张量注意力训练：高阶 Transformer 的可证明高效学习

Tensor Attention Training: Provably Efficient Learning of Higher-order  Transformers

Transformer training is notoriously difficult, requiring a careful design of
optimizers and use of various heuristics. We make progress towards
understanding the subtleties of training transformers by carefully studying a
simple yet canonical linearized shallow transformer model. Specifically, we
train linear transformers to solve regression tasks, inspired by J. von Oswald
et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we
observe that our proposed linearized models can reproduce several prominent
aspects of transformer training dynamics. Consequently, the results obtained in
this paper suggest that a simple linearized transformer model could actually be
a valuable, realistic abstraction for understanding transformer optimization.

通过对线性化浅层 transformer 模型的研究，我们对 transformer 训练的复杂性有了更深入的了解，并观察到线性化模型能够重现 transformer 训练动态的几个重要方面，因此，本文的结果表明简单的线性化 transformer 模型实际上能够是理解 transformer 优化的有价值的现实抽象。

线性注意力或许是你所需的全部（理解 Transformer 优化的）

Linear attention is (maybe) all you need (to understand transformer  optimization)

We evaluate three simple, normalization-centric changes to improve
Transformer training. First, we show that pre-norm residual connections
(PreNorm) and smaller initializations enable warmup-free, validation-based
training with large learning rates. Second, we propose $\ell_2$ normalization
with a single scale parameter (ScaleNorm) for faster training and better
performance. Finally, we reaffirm the effectiveness of normalizing word
embeddings to a fixed length (FixNorm). On five low-resource translation pairs
from TED Talks-based corpora, these changes always converge, giving an average
+1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on
IWSLT'15 English-Vietnamese. We observe sharper performance curves, more
consistent gradient norms, and a linear relationship between activation scaling
and decoder depth. Surprisingly, in the high-resource setting (WMT'14
English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades
performance.

通过 PreNorm、ScaleNorm 和 FixNorm 三种方法的应用，能够加速模型训练，使其更加稳定，从而在五种低资源的翻译对中得到了 1.1 BLEU 的提升并在 IWSLT'15 上获得 32.8 BLEU 的表现。