BriefGPT.xyz
Jun, 2020
随机梯度下降中噪声的泛化益处
On the Generalization Benefit of Noise in Stochastic Gradient Descent
HTML
PDF
Samuel L. Smith, Erich Elsen, Soham De
TL;DR
研究表明在拥有相同迭代次数的情况下,小或适中大小的batch在测试集上比非常大的batch具有更好的表现,同时研究如何随着预算增长而改变最佳学习率计划,并提供一个基于随机微分方程的SGD动态的理论解释。
Abstract
It has long been argued that
minibatch stochastic gradient descent
can generalize better than large batch gradient descent in
deep neural networks
. However recent papers have questioned this claim, arguing that t
→