BriefGPT.xyz
Dec, 2017
由Adam优化器转为SGD优化器提升泛化性能
Improving Generalization Performance by Switching from Adam to SGD
HTML
PDF
Nitish Shirish Keskar, Richard Socher
TL;DR
提出一种混合方法 SWATS 进行训练,开头使用自适应方法 Adam,后期如果符合一定条件则切换至 SGD。实验证明,SWATS 能够缩短自适应方法和 SGD 之间的泛化差距,在多数任务上表现良好。
Abstract
Despite superior training outcomes,
adaptive optimization methods
such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to
stochastic gradient descent
(SGD). These methods tend to perform
→