Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning. Popular examples include training deep neural networks, dimensionality reduction, and etc. Due to the lack of convexity and the extra momentum term, the optimization theory of MSGD is still largely unknown. In this paper, we study this fundamental optimization algorithm based on the so-called "strict saddle problem." By diffusion approximation type analysis, our study shows that the momentum \emph{helps escape from saddle points}, but \emph{hurts the convergence within the neighborhood of optima} (if without the step size annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks. Moreover, our analysis applies the martingale method and "Fixed-State-Chain" method from the stochastic approximation literature, which are of independent interest.

本文通过对非凸优化问题的扩散逼近，分析了Momentum随机梯度下降算法的算法行为，发现该算法对于强鞍点的逃逸具有帮助，但在优化器的周围区域内妨碍了收敛（未进行步长退火或动量退火），本文的理论发现部分验证了MSGD在训练深度神经网络中的实证成功。

非凸优化中动量 SGD 的扩散近似理论