Previous work has demonstrated that MLPs within ReLU Transformers exhibit high levels of sparsity, with many of their activations equal to zero for any given token. We build on that work to more deeply explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns over the course of a sequence or batch, demonstrating that the different layers within small transformers exhibit distinctly layer-specific patterns on both of these fronts. In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity, and explore implications for the structure of feature representations being learned at different depths of the model. We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being primarily driven by the dynamics of training, rather than simply occurring randomly or accidentally as a result of outliers.

对于在ReLU变换器中的MLPs，先前的研究表明它们呈现出很高的稀疏性，其中许多激活值为零。本文在此基础上进一步探索了训练过程中令牌级稀疏性的演化以及它与序列或批次的更广泛稀疏模式之间的关系，明确指出小型变换器中的不同层在这两个方面都呈现出明显的层特异性模式。特别地，我们证明了网络的第一层和最后一层与稀疏性具有独特且在许多方面相反的关系，并探讨了在模型不同深度学习中所学到的特征表达结构的含义。此外，我们还探讨了ReLU维度“关闭”的现象，并展示了证据表明“神经元死亡”主要受训练动态驱动，而不仅仅是由于离群值的随机或意外产生。

揭示ReLU Transformers中的层依赖激活稀疏模式