旧优化器，新范数：选集

Sep, 2024

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein, Laker Newhouse

TL;DR本研究解决了深度学习优化器理论中的局限性，提出对Adam、Shampoo和Prodigy三种方法进行新的理解，强调它们可被视作在特定范数下的最陡下降方法。研究指出，通过为不同角色的张量分配不同的操作范数，可以开辟新的训练算法设计空间，从而提升模型的稳定性和训练效率。

Abstract

Deep Learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In