The choice of batch sizes in stochastic gradient optimizers is critical for
model training. However, the practice of varying batch sizes throughout the
training process is less explored compared to other hyperparameters. We
investigate adaptive batch size strategies derived from adaptive sampling
methods, traditionally applied only in stochastic gradient descent. Given the
significant interplay between learning rates and batch sizes, and considering
the prevalence of adaptive gradient methods in deep learning, we emphasize the
need for adaptive batch size strategies in these contexts. We introduce
AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase
batch sizes during training, while model updates are performed using AdaGrad
and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a
rate of $\mathscr{O}(1/K)$ for finding a first-order stationary point of smooth
nonconvex functions within $K$ iterations. AdaGrad also demonstrates similar
convergence properties when integrated with a novel coordinate-wise variant of
our adaptive batch size strategies. Our theoretical claims are supported by
numerical experiments on various image classification tasks, highlighting the
enhanced adaptability of progressive batching protocols in deep learning and
the potential of such adaptive batch size strategies with adaptive gradient
optimizers in large-scale model training.

通过使用自适应批次大小策略，引入了 AdAdaGrad 和 AdAdaGradNorm，在深度学习中展示了逐步批处理协议的提升适应性以及与自适应梯度优化器结合使用的自适应批次大小策略的潜力。

AdAdaGrad：自适应梯度方法的自适应批次大小方案

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Adaptive gradient optimizers like Adam(W) are the default training algorithms
for many deep learning architectures, such as transformers. Their diagonal
preconditioner is based on the gradient outer product which is incorporated
into the parameter update via a square root. While these methods are often
motivated as approximate second-order methods, the square root represents a
fundamental difference. In this work, we investigate how the behavior of
adaptive methods changes when we remove the root, i.e. strengthen their
second-order motivation. Surprisingly, we find that such square-root-free
adaptive methods close the generalization gap to SGD on convolutional
architectures, while maintaining their root-based counterpart's performance on
transformers. The second-order perspective also has practical benefits for the
development of adaptive methods with non-diagonal preconditioner. In contrast
to root-based counterparts like Shampoo, they do not require numerically
unstable matrix square roots and therefore work well in low precision, which we
demonstrate empirically. This raises important questions regarding the
currently overlooked role of adaptivity for the success of adaptive methods.

去掉平方根的自适应方法能够改善在卷积架构上的泛化差异，同时保持其基于平方根的对应物在转换器上的性能，从而提出了二阶的视角来发展带有非对角线的自适应方法，它们不需要数值不稳定的矩阵平方根，在低精度下工作良好。

自适应梯度方法中是否可以去除平方根？一个二阶视角

Can We Remove the Square-Root in Adaptive Gradient Methods? A  Second-Order Perspective

The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 )
refers to a regime where a long period of overfitting is followed by a
seemingly sudden transition to perfect generalization. In this paper, we
attempt to reveal the underpinnings of Grokking via a series of empirical
studies. Specifically, we uncover an optimization anomaly plaguing adaptive
optimizers at extremely late stages of training, referred to as the Slingshot
Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by
the cyclic phase transitions between stable and unstable training regimes, and
can be easily monitored by the cyclic behavior of the norm of the last layers
weights. We empirically observe that without explicit regularization, Grokking
as reported in ( arXiv:2201.02177 ) almost exclusively happens at the onset of
Slingshots, and is absent without it. While common and easily reproduced in
more general settings, the Slingshot Mechanism does not follow from any known
optimization theories that we are aware of, and can be easily overlooked
without an in depth examination. Our work points to a surprising and useful
inductive bias of adaptive gradient optimizers at late stages of training,
calling for a revised theoretical analysis of their origin.

本文旨在通过一系列实证研究揭示 Grokking 现象的基础原理，并发现了一个被称为弹弓机制的适应性优化器优化异常，该异常是 Grokking 现象的一个显著表现。