Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.

探究在Transformer的自我注意层中可能发生的排名坍塌现象及其影响，发现其会导致查询和键的梯度消失，导致训练受阻，但可以通过适当的深度相关的残差分支缩放来预防，而特定的架构超参数会导致查询和值的梯度的不均衡，这解释了为什么在Transformers的优化中广泛使用自适应方法。

Transformer中的信号传播：理论视角和秩崩溃的作用