Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.

本文揭示了关于Adam算法等自适应梯度方法在深度学习中的训练动态的知识匮乏。研究结果发现，在Full-batch和足够大的Batch设置中，Hessian预处理的最大特征值通常会达到某个数值，即梯度下降算法的稳定阈值。此外，即使是自适应方法在稳定边缘的训练中，其行为也不同于非自适应方法，因为它们可以不断进入高曲率区域，同时调整预处理器来进行补偿。

稳定性边缘的自适应梯度方法