Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $\textit{entropy collapse}$. As a remedy, we propose $\sigma$Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that the proposed reparameterization successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with $\sigma$Reparam on image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling tasks, across Transformer architectures. We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.

本文通过追踪每层Transformer中attention head的注意熵来检验其训练动态。作者提出了一个名为entropy collapse的现象，即低的注意熵伴随着高的训练不稳定性，提出了一种简单高效的解决方案sigma Reparam可以避免这种现象，并进一步证明了注意熵的下限。作者在图像分类、自监督学习、机器翻译、自动语音识别和语言建模任务中测试了sigma Reparam，在各种Transformer结构中均能提供更稳定和鲁棒的训练，甚至不需要预热、重量衰减、层归一化或自适应优化器。

防止注意力熵崩塌以稳定Transformer训练