The early phase of training of deep neural networks is critical for their
final performance. In this work, we study how the hyperparameters of stochastic
gradient descent (SGD) used in the early phase of training affect the rest of
the optimization trajectory. We argue for the existence of the "break-even"
point on this trajectory, beyond which the curvature of the loss surface and
noise in the gradient are implicitly regularized by SGD. In particular, we
demonstrate on multiple classification tasks that using a large learning rate
in the initial phase of training reduces the variance of the gradient, and
improves the conditioning of the covariance of gradients. These effects are
beneficial from the optimization perspective and become visible after the
break-even point. Complementing prior work, we also show that using a low
learning rate results in bad conditioning of the loss surface even for a neural
network with batch normalization layers. In short, our work shows that key
properties of the loss surface are strongly influenced by SGD in the early
phase of training. We argue that studying the impact of the identified effects
on generalization is a promising future direction.

本文探讨了随机梯度下降在神经网络早期训练阶段中的超参数，指出通过在初期采用大学习率可以减小梯度的方差和提高梯度的协方差矩阵的条件数，在超过 “盈亏平衡点” 之后，通过随机梯度下降法优化可以隐式地正则化损失曲面的曲率以及梯度中的噪声等问题，这对于神经网络的优化效果具有积极作用，研究这些影响对于泛化性能的影响是一个有前途的研究方向。