In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 100s of layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across Encoder-only, Decoder-only and Encoder-Decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for image classification.

通过开发统一的信号传播理论和提供控制转换模型中正向和反向信号时刻的公式，本研究致力于理解和缓解与高注意力分数相关的梯度消失/爆炸、秩坍缩和不稳定性。我们还提出了DeepScaleLM，一种初始化和缩放方案，通过整个模型保持单位输出/梯度时刻，从而使得能够训练包含上百层的非常深的模型。在多个数据集和模型大小上，我们发现转换模型可以更深 - 我们的深度模型在语言建模、语音翻译和图像分类上优于浅层模型，在只编码器、只解码器和编码器-解码器变体以及Pre-LN和Post-LN transformer上均如此。这些改进也会提高下游问答任务的性能，以及图像分类的鲁棒性。

Transformer稳定了：一种端到端的信号传播理论用于语言模型