BriefGPT.xyz
Apr, 2025
缪子优化器加速Grokking现象
Muon Optimizer Accelerates Grokking
HTML
PDF
Amund Tveit, Bjørn Remseth, Arve Skogvold
TL;DR
本研究探讨了不同优化器对Grokking现象的影响,该现象指模型呈现出延迟泛化的特征。通过对七个数字任务的实验,发现缪子优化器通过引入谱范数约束和二阶信息,相比广泛使用的AdamW优化器显著加快了Grokking的发生,平均Grokking周期从153.09降低到102.89,表明优化器的选择在促进记忆与泛化之间的转变中起到了关键作用。
Abstract
This paper investigates the impact of different optimizers on the
Grokking
phenomenon, where models exhibit delayed
Generalization
. We conducted experiments across seven numerical tasks (primarily modular arithme
→