Stochastic optimization algorithms, such as Stochastic Gradient Descent (SGD) and its variants, are mainstream methods for training deep networks in practice. However, the theoretical mechanism behind gradient noise still remains to be further investigated. Deep learning is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. In this paper, we focus on a fundamental problem in deep learning, "How can deep learning usually find flat minima among so many minima?" To answer the question, we develop a density diffusion theory (DDT) for revealing the fundamental dynamical mechanism of SGD and deep learning. More specifically, we study how escape time from loss valleys to the outside of valleys depends on minima sharpness, gradient noise and hyperparameters. One of the most interesting findings is that stochastic gradient noise from SGD can help escape from sharp minima exponentially faster than flat minima, while white noise can only help escape from sharp minima polynomially faster than flat minima. We also find large-batch training requires exponentially many iterations to pass through sharp minima and find flat minima. We present direct empirical evidence supporting the proposed theoretical results.

使用密度扩散理论（DDT），我们首次理论上和实证上证明，SGD比GD更有利于发现平坦极值点，同时表明了使用大批量训练来搜索平坦极值点需要指数级时间。

深度学习动态扩散理论：随机梯度下降指数级偏重于平坦极小值