Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes, i.e., diagonal preconditioners. We show that the optimal preconditioner is based on taking the absolute value of the Hessian's eigenvalues, which is not what Newton and classical preconditioners like Jacobi's do. In this paper, we propose a novel adaptive learning rate scheme based on the equilibration preconditioner and show that RMSProp approximates it, which may explain some of its success in the presence of saddle points. Whereas RMSProp is a biased estimator of the equilibration preconditioner, the proposed stochastic estimator, ESGD, is unbiased and only adds a small percentage to computing time. We find that both schemes yield very similar step directions but that ESGD sometimes surpasses RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

该论文提出了一种基于 equilibration preconditioner 的新型自适应学习率方法：ESGD，与 RMSProp 相比收敛速度更快，在非凸问题上表现更好。

非凸优化的平衡自适应学习率