自适应带动量的 SGD 高概率分析

Jul, 2020

A High Probability Analysis of Adaptive SGD with Momentum

Xiaoyu Li, Francesco Orabona

TL;DR本文对机器学习应用中广泛使用的随机梯度下降及其变种算法在非凸优化问题中的收敛性做了一系列的理论分析，证明了在弱假设条件下，Delayed AdaGrad with momentum算法可高概率收敛于全局最优解。

Abstract

stochastic gradient descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and →