This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same $O(\log{T}/\sqrt{T})$ convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

本文研究一类自适应梯度基于动量的算法，这些算法同时使用过去的梯度更新搜索方向和学习率。该类算法被称为Adam类型，研究了一些充分条件，保证了这些方法在解决非凸优化问题时的收敛性，为训练深度神经网络提供了理论支持。另外，文中提出了一类（确定性）增量自适应梯度算法，收敛速度与Adam类型算法相同，可以应用于更广泛的机器学习和优化问题。

关于一类Adam型非凸优化算法的收敛性