The adaptive stochastic gradient descent (SGD) with momentum has been widely
adopted in deep learning as well as convex optimization. In practice, the last
iterate is commonly used as the final solution to make decisions. However, the
available regret analysis and the setting of constant momentum parameters only
guarantee the optimal convergence of the averaged solution. In this paper, we
fill this theory-practice gap by investigating the convergence of the last
iterate (referred to as individual convergence), which is a more difficult task
than convergence analysis of the averaged solution. Specifically, in the
constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB)
method, in which only the step size is updated using the exponential moving
average strategy, attains an optimal individual convergence rate of
$O(\frac{1}{\sqrt{t}})$, as opposed to the optimality of $O(\frac{\log t}{\sqrt
{t}})$ of SGD, where $t$ is the number of iterations. Our new analysis not only
shows how the HB momentum and its time-varying weight help us to achieve the
acceleration in convex optimization but also gives valuable hints how the
momentum parameters should be scheduled in deep learning. Empirical results on
optimizing convex functions and training deep networks validate the correctness
of our convergence analysis and demonstrate the improved performance of the
adaptive HB methods.

本文旨在解决现实应用中使用随机梯度下降法进行深度学习和凸优化时，普遍使用最后一次迭代作为最终解决方案，但唯独它的可用遗憾分析和恒定动量参数设置只保证平均解的最佳收敛问题，并且探究单独收敛分析问题，最终我们证明了：在约束凸问题中，使用 Polyak's Heavy-ball 方法，它只能通过移动平均策略更新步长，即可获得 O（1 / 根号 T）的最优收敛率，而不是普通 SGD 的 O（log T / 根号 T）的优化。同时，我们的新型分析方法不仅阐释了 HB 动量及其时间变化的作用，还给出了有价值的暗示，即动量参数应如何进行安排。同时，针对优化凸函数和训练深度网络的实证结果，验证了我们收敛分析的正确性，并证明了自适应 HB 方法的改进性能。