Second order stochastic optimizers allow parameter update step size and
direction to adapt to loss curvature, but have traditionally required too much
memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018]
introduced a Kronecker factored preconditioner to reduce these requirements: it
is used for large deep models [Anil et al., 2020] and in production [Anil et
al., 2022]. However, it takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints. In this
paper, we propose a novel factorization, Kronecker Approximation-Domination
(KrAD). Using KrAD, we update a matrix that directly approximates the inverse
empirical Fisher matrix (like full matrix AdaGrad), avoiding inversion and
hence 64-bit precision. We then propose KrADagrad$^\star$, with similar
computational costs to Shampoo and the same regret. Synthetic ill-conditioned
experiments show improved performance over Shampoo for 32-bit precision, while
for several real datasets we have comparable or better generalization.

该论文提出了一种新颖的矩阵分解方法 Kronecker Approximation-Domination (KrAD)，用于直接近似实验 Fisher 矩阵的逆，避免了反转和 64 位精度，从而实现与 Shampoo 相似的计算成本和相同的 regret，同时在 32 位精度下比 Shampoo 表现更好。

KrADagrad：克罗内克近似主导梯度预处理随机优化

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned  Stochastic Optimization

Large-batch training has been essential in leveraging large-scale datasets
and models in deep learning. While it is computationally beneficial to use
large batch sizes, it often requires a specially designed learning rate (LR)
schedule to achieve a comparable level of performance as in smaller batch
training. Especially, when the number of training epochs is constrained, the
use of a large LR and a warmup strategy is critical in the final performance of
large-batch training due to the reduced number of updating steps. In this work,
we propose an automated LR scheduling algorithm which is effective for neural
network training with a large batch size under the given epoch budget. In
specific, the whole schedule consists of two phases: adaptive warmup and
predefined decay, where the LR is increased until the training loss no longer
decreases and decreased to zero until the end of training. Here, whether the
training loss has reached the minimum value is robustly checked with Gaussian
process smoothing in an online manner with a low computational burden. Coupled
with adaptive stochastic optimizers such as AdamP and LAMB, the proposed
scheduler successfully adjusts the LRs without cumbersome hyperparameter tuning
and achieves comparable or better performances than tuned baselines on various
image classification benchmarks and architectures with a wide range of batch
sizes.

本文提出了一种有效的 LR 调试算法，其中包括自适应的预热和预定义的衰减，通过高斯过程平滑的在线检查方法可以有效地训练具有大批次大小的神经网络。

大批量训练自动学习率调度器

Automated Learning Rate Scheduler for Large-batch Training

Large-batch training has become a commonly used technique when training
neural networks with a large number of GPU/TPU processors. As batch size
increases, stochastic optimizers tend to converge to sharp local minima,
leading to degraded test performance. Current methods usually use extensive
data augmentation to increase the batch size, but we found the performance gain
with data augmentation decreases as batch size increases, and data augmentation
will become insufficient after certain point. In this paper, we propose to use
adversarial learning to increase the batch size in large-batch training.
Despite being a natural choice for smoothing the decision surface and biasing
towards a flat region, adversarial learning has not been successfully applied
in large-batch training since it requires at least two sequential gradient
computations at each step, which will at least double the running time compared
with vanilla training even with a large number of processors. To overcome this
issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that
decouple the sequential gradient computations in adversarial learning by
utilizing staled parameters. Experimental results demonstrate that ConAdv can
successfully increase the batch size on ResNet-50 training on ImageNet while
maintaining high accuracy. In particular, we show ConAdv along can achieve
75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and
the accuracy can be further improved to 76.2\% when combining ConAdv with data
augmentation. This is the first work successfully scales ResNet-50 training
batch size to 96K.

本研究提出使用对抗学习 (adversarial learning) 技术来增加大批量训练 (large-batch training) 的批量大小，以克服随着批量大小的增加而降低的数据增强表现的问题，并通过 Concurrent Adversarial Learning (ConAdv) 方法来解决对抗学习中的时间成本问题，在 ImageNet ResNet-50 训练中成功将批量大小扩展到 96K，并在保持高精度的同时大幅提高了模型的训练效率。