BriefGPT.xyz
Feb, 2020
深度学习动态扩散理论:随机梯度下降指数级偏重于平坦极小值
A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast
HTML
PDF
Zeke Xie, Issei Sato, Masashi Sugiyama
TL;DR
使用密度扩散理论(DDT),我们首次理论上和实证上证明,SGD比GD更有利于发现平坦极值点,同时表明了使用大批量训练来搜索平坦极值点需要指数级时间。
Abstract
Stochastic optimization algorithms, such as
stochastic gradient descent
(SGD) and its variants, are mainstream methods for training
deep networks
in practice. However, the theoretical mechanism behind gradient no
→