Gradient regularization (GR), which aims to penalize the gradient norm atop
the loss function, has shown promising results in training modern
over-parameterized deep neural networks. However, can we trust this powerful
technique? This paper reveals that GR can cause performance degeneration in
adaptive optimization scenarios, particularly with learning rate warmup. Our
empirical and theoretical analyses suggest this is due to GR inducing
instability and divergence in gradient statistics of adaptive optimizers at the
initial training stage. Inspired by the warmup heuristic, we propose three GR
warmup strategies, each relaxing the regularization effect to a certain extent
during the warmup course to ensure the accurate and stable accumulation of
gradients. With experiments on Vision Transformer family, we confirm the three
GR warmup strategies can effectively circumvent these issues, thereby largely
improving the model performance. Meanwhile, we note that scalable models tend
to rely more on the GR warmup, where the performance can be improved by up to
3\% on Cifar10 compared to baseline GR. Code is available at
\href{https://github.com/zhaoyang-0204/gnp}{this https URL}.

本文揭示了梯度正则化（GR）在自适应优化场景中可能导致性能下降的问题，并提出了三种 GR 热身策略来解决这些问题，实验证实这三种策略能够大幅提高模型性能。

梯度正则化何时会有害？

When Will Gradient Regularization Be Harmful?

Although gradient descent with momentum is widely used in modern deep
learning, a concrete understanding of its effects on the training trajectory
still remains elusive. In this work, we empirically show that momentum gradient
descent with a large learning rate and learning rate warmup displays large
catapults, driving the iterates towards flatter minima than those found by
gradient descent. We then provide empirical evidence and theoretical intuition
that the large catapult is caused by momentum "amplifying" the
self-stabilization effect (Damian et al., 2023).

通过实证研究，我们发现使用较大学习速率和学习速率预热的动量梯度下降会产生大的弹射效应，将迭代点推向更平坦的最小值，我们提供了实证证据和理论解释表明这种弹射效应是由于动量 “放大” 了自稳定效应。

动量梯度下降中的大型弹射器研究

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical  Study

The learning rate warmup heuristic achieves remarkable success in stabilizing
training, accelerating convergence and improving generalization for adaptive
stochastic optimization algorithms like RMSprop and Adam. Here, we study its
mechanism in details. Pursuing the theory behind warmup, we identify a problem
of the adaptive learning rate (i.e., it has problematically large variance in
the early stage), suggest warmup works as a variance reduction technique, and
provide both empirical and theoretical evidence to verify our hypothesis. We
further propose RAdam, a new variant of Adam, by introducing a term to rectify
the variance of the adaptive learning rate. Extensive experimental results on
image classification, language modeling, and neural machine translation verify
our intuition and demonstrate the effectiveness and robustness of our proposed
method. All implementations are available at:
this https URL

文中探讨学习率预热方法在稳定训练、加速收敛和改善通用性方面的可靠性，发现自适应学习率在初始阶段有问题，建议使用预热作为方差缩减技术，并提出了一种新变量 RAdam 用于改善自适应学习率方差，实验结果表明其有效性和鲁棒性。