Adaptive gradient algorithms have been widely adopted in training large-scale
deep neural networks, especially large foundation models. Despite their huge
success in practice, their theoretical advantages over stochastic gradient
descent (SGD) have not been fully understood, especially in the large
batch-size setting commonly used in practice. This is because the only
theoretical result that can demonstrate the benefit of Adagrad over SGD was
obtained in the original paper of Adagrad for nonsmooth objective functions.
However, for nonsmooth objective functions, there can be a linear slowdown of
convergence when batch size increases, and thus a convergence analysis based on
nonsmooth assumption cannot be used for large batch algorithms. In this work,
we resolve this gap between theory and practice by providing a new analysis of
Adagrad on both convex and nonconvex smooth objectives suitable for the large
batch setting. It is shown that under the anisotropic smoothness and noise
conditions, increased batch size does not slow down convergence for Adagrad,
and thus it can still achieve a faster convergence guarantee over SGD even in
the large batch setting. We present detailed comparisons between SGD and
Adagrad to provide a better understanding of the benefits of adaptive gradient
methods. Experiments in logistic regression and instruction following
fine-tuning tasks provide strong evidence to support our theoretical analysis.

通过在大批量设置下对 Adagrad 进行新分析，证明了它在凸平滑目标和非凸平滑目标上不会因批量大小增加而收敛减慢，因此在大批量设置中仍然可以比 SGD 更快地实现收敛，进而解决了理论和实践之间的差距。

Adagrad 在各向异性光滑下的大批量分析

Large Batch Analysis for Adagrad Under Anisotropic Smoothness

Adaptive gradient algorithms perform gradient-based updates using the history
of gradients and are ubiquitous in training deep neural networks. While
adaptive gradient methods theory is well understood for minimization problems,
the underlying factors driving their empirical success in min-max problems such
as GANs remain unclear. In this paper, we aim at bridging this gap from both
theoretical and empirical perspectives. First, we analyze a variant of
Optimistic Stochastic Gradient (OSG) proposed in~\citep{daskalakis2017training}
for solving a class of non-convex non-concave min-max problem and establish
$O(\epsilon^{-4})$ complexity for finding $\epsilon$-first-order stationary
point, in which the algorithm only requires invoking one stochastic first-order
oracle while enjoying state-of-the-art iteration complexity achieved by
stochastic extragradient method by~\citep{iusem2017extragradient}. Then we
propose an adaptive variant of OSG named Optimistic Adagrad (OAdagrad) and
reveal an \emph{improved} adaptive complexity
$O\left(\epsilon^{-\frac{2}{1-\alpha}}\right)$, where $\alpha$ characterizes
the growth rate of the cumulative stochastic gradient and $0\leq \alpha\leq
1/2$. To the best of our knowledge, this is the first work for establishing
adaptive complexity in non-convex non-concave min-max optimization.
Empirically, our experiments show that indeed adaptive gradient algorithms
outperform their non-adaptive counterparts in GAN training. Moreover, this
observation can be explained by the slow growth rate of the cumulative
stochastic gradient, as observed empirically.

本文旨在从理论和实证角度分析适应性梯度算法在解决非凸非凹极小极大问题中的性能，并提出了一种名为乐观阿达格勒的自适应变体算法，证明了非凸非凹极小极大优化的自适应复杂性，并在生成对抗网络培训中显示出优越性能。

生成对抗网络中自适应梯度算法的深入理解

Towards Better Understanding of Adaptive Gradient Algorithms in  Generative Adversarial Nets

L$_2$ regularization and weight decay regularization are equivalent for
standard stochastic gradient descent (when rescaled by the learning rate), but
as we demonstrate this is \emph{not} the case for adaptive gradient algorithms,
such as Adam. While common implementations of these algorithms employ L$_2$
regularization (often calling it "weight decay" in what may be misleading due
to the inequivalence we expose), we propose a simple modification to recover
the original formulation of weight decay regularization by \emph{decoupling}
the weight decay from the optimization steps taken w.r.t. the loss function. We
provide empirical evidence that our proposed modification (i) decouples the
optimal choice of weight decay factor from the setting of the learning rate for
both standard SGD and Adam and (ii) substantially improves Adam's
generalization performance, allowing it to compete with SGD with momentum on
image classification datasets (on which it was previously typically
outperformed by the latter). Our proposed decoupled weight decay has already
been adopted by many researchers, and the community has implemented it in
TensorFlow and PyTorch; the complete source code for our experiments is
available at this https URL

L$_2$ 正则化与权重衰减正则化在标准随机梯度下降中是等价的，但是在自适应梯度算法，比如 Adam 中并不相同。本文通过 “解耦” 权重衰减与代价函数的优化步骤，提出了一个简单的修改，从而恢复了原始的权重衰减规则。实验证据表明我们提出的修改不仅能够使得标准 SGD 和 Adam 中的权重衰减因素的最优选择与学习率的设置相分离，还能够显著提高 Adam 的泛化性能，从而使得它在图像分类数据集中可以与 SGD with momentum 竞争。