Continual learning, an important aspect of artificial intelligence and
machine learning research, focuses on developing models that learn and adapt to
new tasks while retaining previously acquired knowledge. Existing continual
learning algorithms usually involve a small number of tasks with uniform sizes
and may not accurately represent real-world learning scenarios. In this paper,
we investigate the performance of continual learning algorithms with a large
number of tasks drawn from a task distribution that is long-tail in terms of
task sizes. We design one synthetic dataset and two real-world continual
learning datasets to evaluate the performance of existing algorithms in such a
setting. Moreover, we study an overlooked factor in continual learning, the
optimizer states, e.g. first and second moments in the Adam optimizer, and
investigate how it can be used to improve continual learning performance. We
propose a method that reuses the optimizer states in Adam by maintaining a
weighted average of the second moments from previous tasks. We demonstrate that
our method, compatible with most existing continual learning algorithms,
effectively reduces forgetting with only a small amount of additional
computational or memory costs, and provides further improvements on existing
continual learning algorithms, particularly in a long-tail task sequence.

该论文研究了具有大量任务的持续学习算法在长尾任务序列中的性能，并探讨了优化器状态作为提高持续学习性能的一种因素。通过维护来自先前任务的第二矩的加权平均，论文提出的方法有效减少遗忘，同时在现有的持续学习算法中取得改进。

从长尾分布中持续学习众多任务

Continual Learning of Numerous Tasks from Long-tail Distributions

Optimizer states are a major source of memory consumption for training neural
networks, limiting the maximum trainable model within given memory budget.
Compressing the optimizer states from 32-bit floating points to lower bitwidth
is promising to reduce the training memory footprint, while the current lowest
achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth
down to 4-bit through a detailed empirical analysis of first and second order
momentums. Specifically, we find that momentums have complicated outlier
patterns, that current block-wise quantization cannot accurately approximate.
We use a smaller block size and propose to utilize both row-wise and
column-wise information for better quantization. We further identify a zero
point problem of quantizing the second-order momentum, and solve this problem
with a linear quantizer that excludes the zero point. Our 4-bit optimizer is
evaluated on a wide variety of benchmarks including natural language
understanding, machine translation, image classification, and instruction
tuning. On all the tasks our optimizers can achieve comparable accuracy with
their full-precision counterparts, while enjoying better memory efficiency.

通过详细的经验分析，本研究将优化器状态位宽降到 4 位，通过更好的量化方法，解决了动量中的离群值问题和二阶动量的零点问题，从而在自然语言理解、机器翻译、图像分类和指令优化等任务中实现了与全精度对应方法相当的准确性，同时提高了内存效率。

具有 4 位状态的内存高效优化器

Memory Efficient Optimizers with 4-bit States

Running out of GPU memory has become a main bottleneck for large-scale DNN
training. How to reduce the memory footprint during training has received
intensive research attention. We find that previous gradient accumulation
reduces activation memory but fails to be compatible with gradient memory
reduction due to a contradiction between preserving gradients and releasing
gradients. To address this issue, we propose a novel optimizer accumulation
method for Adam, named Adam Accumulation (AdamA), which enables reducing both
activation and gradient memory. Specifically, AdamA directly integrates
gradients into optimizer states and accumulates optimizer states over
micro-batches, so that gradients can be released immediately after use. We
mathematically and experimentally demonstrate AdamA yields the same convergence
properties as Adam. Evaluated on transformer-based models, AdamA achieves up to
23% memory reduction compared to gradient accumulation with less than 2%
degradation in training throughput. Notably, AdamA can work together with
memory reduction methods for optimizer states to fit 1.26x~3.14x larger models
over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.

研究了大规模 DNN 训练中 GPU 内存问题，提出了一种名为 AdamA 的优化器累加方法，能同时减少激活内存和梯度内存的占用，与 Adam 相比性能不差，能在 PyTorch 和 DeepSpeed 等框架下使用。