Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method (Padam), which unifies the Adam/Amsgrad with SGD to achieve the best from both worlds. Experiments on standard benchmarks show that Padam can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

本研究设计了一个新算法，称为部分自适应动量估计方法，通过引入部分自适应参数$p$，将Adam/Amsgrad与SGD统一起来，以实现从两个世界中获得最佳结果，并在随机非凸优化设置下证明了我们提出的算法的收敛速度。实验结果表明，与SGD一样，我们的算法可以在训练深度神经网络时维持快速的收敛率，并且可以像Adam/Amsgrad一样进行泛化，这些结果表明从此前的研究中看出，重视使用自适应梯度方法可以有效加速深度神经网络的训练。

自适应梯度方法训练深度神经网络中泛化缺口的解决