Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

本文研究了梯度下降算法在优化神经网络时的表现，发现梯度下降中的离散步骤隐含地通过惩罚大损失梯度轨迹的方式实现了模型的正则化，这种“隐性梯度正则化”导致梯度下降趋向于平坦的最小值，使解决方案对噪声参数扰动有很好的鲁棒性，这一理论有助于解决过拟合问题。

隐式梯度正则化