Recently, there has been increasing interest in efficient pretraining
paradigms for training Transformer-based models. Several recent approaches use
smaller models to initialize larger models in order to save computation (e.g.,
stacking and fusion). In this work, we study the fundamental question of how to
select the best growing strategy from a given pool of growing strategies. Prior
works have extensively focused on loss- and/or function-preserving behavior at
initialization or simply performance at the end of training. Instead, we
identify that behavior at initialization can be misleading as a predictor of
final performance and present an alternative perspective based on early
training dynamics, which we call "landscape-aware growing (LAG)". We perform
extensive analysis of correlation of the final performance with performance in
the initial steps of training and find early and more accurate predictions of
the optimal growing strategy (i.e., with only a small "lag" after
initialization). This perspective also motivates an adaptive strategy for
gradual stacking.

Efficient pretraining paradigms and growing strategies for Transformer-based models are studied, focusing on early training dynamics and an adaptive strategy for gradual stacking.

景观感知增长：一点点 Lag 的力量

Landscape-Aware Growing: The Power of a Little LAG

Excursions in gradient magnitude pose a persistent challenge when training
deep networks. In this paper, we study the early training phases of deep
normalized ReLU networks, accounting for the induced scale invariance by
examining effective learning rates (LRs). Starting with the well-known fact
that batch normalization (BN) leads to exponentially exploding gradients at
initialization, we develop an ODE-based model to describe early training
dynamics. Our model predicts that in the gradient flow, effective LRs will
eventually equalize, aligning with empirical findings on warm-up training.
Using large LRs is analogous to applying an explicit solver to a stiff
non-linear ODE, causing overshooting and vanishing gradients in lower layers
after the first step. Achieving overall balance demands careful tuning of LRs,
depth, and (optionally) momentum. Our model predicts the formation of spreads
in effective LRs, consistent with empirical measurements. Moreover, we observe
that large spreads in effective LRs result in training issues concerning
accuracy, indicating the importance of controlling these dynamics. To further
support a causal relationship, we implement a simple scheduling scheme
prescribing uniform effective LRs across layers and confirm accuracy benefits.

本文研究了深度规范化 ReLU 网络的早期训练阶段，并通过研究有效学习率（LR）来解释梯度流的影响，发现使用大 LR 类似于对非线性 ODE 应用显式求解器，在第一步后导致底层出现过振荡和梯度消失，因此在深度，LR 和动量（可选）上需要进行精细调整，以保持总体平衡。

有效学习率的扩展：早期训练中批量归一化的风险

Spreads in Effective Learning Rates: The Perils of Batch Normalization  During Early Training

Deep Neural Networks (DNNs) are prone to learn shortcut patterns that damage
the generalization of the DNN during deployment. Shortcut Learning is
concerning, particularly when the DNNs are applied to safety-critical domains.
This paper aims to better understand shortcut learning through the lens of the
learning dynamics of the internal neurons during the training process. More
specifically, we make the following observations: (1) While previous works
treat shortcuts as synonymous with spurious correlations, we emphasize that not
all spurious correlations are shortcuts. We show that shortcuts are only those
spurious features that are "easier" than the core features. (2) We build upon
this premise and use instance difficulty methods (like Prediction Depth) to
quantify "easy" and to identify this behavior during the training phase. (3) We
empirically show that shortcut learning can be detected by observing the
learning dynamics of the DNN's early layers, irrespective of the network
architecture used. In other words, easy features learned by the initial layers
of a DNN early during the training are potential shortcuts. We verify our
claims on simulated and real medical imaging data and justify the empirical
success of our hypothesis by showing the theoretical connections between
Prediction Depth and information-theoretic concepts like V-usable information.
Lastly, our experiments show the insufficiency of monitoring only accuracy
plots during training (as is common in machine learning pipelines), and we
highlight the need for monitoring early training dynamics using example
difficulty metrics.

本文通过观察 Deep Neural Networks (DNNs) 内部神经元的学习动态，提出了易学特征会导致 Shortcut Learning 的假设，并用实验验证了此假设。文章主张在早期的训练动态中监测模型的表现而非仅仅监测模型准确率。