In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp.~saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp.~linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by \citet{keskar2016large}, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.

本研究从扰动动力学系统的角度研究了SGD优化算法在非凸优化问题中的应用，发现扰动过程可以弱化地近似SGD算法，并且批量大小对于深度神经网络具有明显影响，小批量有助于SGD算法避免不稳定驻点和锐利极小值，并且我们的理论表明，为了更好的泛化能力，应在后期增加批量大小以使SGD陷入平坦的极小值点。

非凸随机梯度下降的扩散逼近