This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

本文研究表明惯性梯度下降法可以在较短的迭代次数内收敛于二阶稳定点，收敛速率与梯度下降到一阶稳定点的收敛速率匹配，当所有鞍点都是非退化的时，所有的二阶稳定点都是局部最小值，该结果表明惯性梯度下降法几乎可以在无成本的情况下脱离鞍点，并可直接应用于许多机器学习应用中，包括深度学习。

如何高效地逃离鞍点