Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new way to estimate a preconditioner by equalizing the amplitudes of parameter changes and the amplitudes of associated gradient changes. Unlike the Hessian inverse like preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which work the best when the Hessian is positive definite, the new preconditioner works equally well for both convex and non-convex optimizations. When stochastic gradient is used, it can naturally damp the gradient noise to stabilize SGD. Efficient preconditioner estimation methods are developed, and with reasonable simplifications, they are applicable to large scaled problems. Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural network or a recurrent neural network requiring extremely long term memories.

本文提出了一种新的方法，通过估计一个预条件器来加速随机梯度下降算法的收敛速度，适用于凸性和非凸性优化，具有稳定梯度降噪的效果，并且经过了大规模问题的有效预条件估计验证，可以在无需调整的情况下，高效解决深度神经网络等复杂问题

预处理随机梯度下降