Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze the use of adaptive stepsizes, like the ones in AdaGrad, in the non-convex setting. We show sufficient conditions for almost sure convergence to a stationary point when the adaptive stepsizes are used, proving the first guarantee for AdaGrad in the non-convex setting. Moreover, we show explicit rates of convergence that automatically interpolates between $O(1/T)$ and $O(1/\sqrt{T})$ depending on the noise of the stochastic gradients, in both the convex and non-convex setting.

通过研究广义AdaGrad步长在凸和非凸设置中，本文证明了这些步长实现梯度渐近收敛于零的充分条件，从而填补了这些方法理论上的空白。此外，本文表明这些步长允许自动适应随机梯度噪声级别在凸和非凸情况下，实现O（1/T）到O（1/根号T）的插值（带有对数项）。

自适应步长随机梯度下降算法的收敛性