Stochastic gradient descent (SGD) with averaging is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be at most O(\log(T)/T). However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the convergence rate of SGD with averaging in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate might really be \Omega(\log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) step, and no significant change of the algorithm is necessary.

本文研究了随机梯度下降在随机情形下的最优性。结果表明，对于光滑问题，算法可以达到最优的O(1/T)收敛速率，但对于非光滑问题，平均收敛速率可能真的是Ω(log(T)/T)，而这不仅仅是分析的产物。反过来，我们展示了一种简单的平均步骤修改方法，足以恢复到O(1/T)收敛速率，而无需对算法做出任何其他改变。此外，我们还给出了支持我们发现的实验结果，并指出了开放性问题。

强凸随机优化的最优梯度下降算法