Normalization techniques are a boon for modern deep learning. They let
weights converge more quickly with often better generalization performances. It
has been argued that the normalization-induced scale invariance among the
weights provides an advantageous ground for gradient descent (GD) optimizers:
the effective step sizes are automatically reduced over time, stabilizing the
overall training procedure. It is often overlooked, however, that the
additional introduction of momentum in GD optimizers results in a far more
rapid reduction in effective step sizes for scale-invariant weights, a
phenomenon that has not yet been studied and may have caused unwanted side
effects in the current practice. This is a crucial issue because arguably the
vast majority of modern deep neural networks consist of (1) momentum-based GD
(e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify
that the widely-adopted combination of the two ingredients lead to the
premature decay of effective step sizes and sub-optimal model performances. We
propose a simple and effective remedy, SGDP and AdamP: get rid of the radial
component, or the norm-increasing direction, at each optimizer step. Because of
the scale invariance, this modification only alters the effective step sizes
without changing the effective update directions, thus enjoying the original
convergence properties of GD optimizers. Given the ubiquity of momentum GD and
scale invariance in machine learning, we have evaluated our methods against the
baselines on 13 benchmarks. They range from vision tasks like classification
(e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to
language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks.
We verify that our solution brings about uniform gains in those benchmarks.
Source code is available at this https URL

本文介绍了正则化技术在深度学习中的重要性，以及在使用冲量梯度下降优化器时可能出现的问题和解决方法。作者提出了 SGDP 和 AdamP 两种解决方案，通过去除每次优化步骤中的径向分量或增加规范的方向，来维护深度神经网络的性能，并验证了这些方法对 13 个基准测试任务的实验结果。

AdamP：针对尺度不变权重下动量优化器的减缓减速方法

AdamP: Slowing Down the Slowdown for Momentum Optimizers on  Scale-invariant Weights

Learning to learn has emerged as an important direction for achieving
artificial intelligence. Two of the primary barriers to its adoption are an
inability to scale to larger problems and a limited ability to generalize to
new tasks. We introduce a learned gradient descent optimizer that generalizes
well to new tasks, and which has significantly reduced memory and computation
overhead. We achieve this by introducing a novel hierarchical RNN architecture,
with minimal per-parameter overhead, augmented with additional architectural
features that mirror the known structure of optimization tasks. We also develop
a meta-training ensemble of small, diverse optimization tasks capturing common
properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM
on problems in this corpus. More importantly, it performs comparably or better
when applied to small convolutional neural networks, despite seeing no neural
networks in its meta-training set. Finally, it generalizes to train Inception
V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps,
optimization problems that are of a vastly different scale than those it was
trained on. We release an open source implementation of the meta-training
algorithm.

通过引入层次循环神经网络优化算法和基于元学习的小任务集，实现了一个新的学习梯度下降优化器，解决了在更大的问题上扩展能力不足和泛化能力受限问题，并在 ImageNet 数据集上通过数千步为 Inception V3 和 ResNet V2 架构进行了优化。