Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.

本文探讨了采用SGD 进行机器学习的收敛性问题，特别是在采用线性可分数据及单调函数损失函数的情况下，证明了 SGD 在固定非零学习率的条件下可以收敛至零损失，对于分类问题中的单调函数损失函数（例如对数损失），每次迭代权重向量趋向于$L_2$最大裕度向量，且损失以$O(1/t)$的速率收敛。

可分离数据上的随机梯度下降：固定学习率的精确收敛