Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD \emph{without} such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the \emph{last} SGD iterate scales as O(\log(T)/\sqrt{T}) for non-smooth convex objective functions, and O(\log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in \citet{RakhShaSri12arxiv} is not as simple to implement). Finally, we provide some experimental illustrations.

本文探讨了在没有光滑假设的情况下，以及通过运行平均方案将SGD迭代转换为具有最佳优化精度的解决方案的性能，并证明了对于凸非光滑目标函数，最后一个SGD迭代的次优性的程度随T的轮次按O（log（T）/ sqrt（T））缩放，对于非光滑强凸情况，次优性的程度随T按O（log（T）/ T）缩放。此外，本文提出了一种新的简单平均方案，并提供了一些实验说明。

非光滑优化的随机梯度下降：收敛结果与最优平均方案