TL;DR本文对机器学习应用中广泛使用的随机梯度下降及其变种算法在非凸优化问题中的收敛性做了一系列的理论分析,证明了在弱假设条件下,Delayed AdaGrad with momentum算法可高概率收敛于全局最优解。
Abstract
stochastic gradient descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and →