Large language models have achieved remarkable success, but their extensive
parameter size necessitates substantial memory for training, thereby setting a
high threshold. While the recently proposed low-memory optimization (LOMO)
reduces memory footprint, its optimization technique, akin to stochastic
gradient descent, is sensitive to hyper-parameters and exhibits suboptimal
convergence, failing to match the performance of the prevailing optimizer for
large language models, AdamW. Through empirical analysis of the Adam optimizer,
we found that, compared to momentum, the adaptive learning rate is more
critical for bridging the gap. Building on this insight, we introduce the
low-memory optimization with adaptive learning rate (AdaLomo), which offers an
adaptive learning rate for each parameter. To maintain memory efficiency, we
employ non-negative matrix factorization for the second-order moment estimation
in the optimizer state. Additionally, we suggest the use of a grouped update
normalization to stabilize convergence. Our experiments with instruction-tuning
and further pre-training demonstrate that AdaLomo achieves results on par with
AdamW, while significantly reducing memory requirements, thereby lowering the
hardware barrier to training large language models.

大型语言模型通常需要较大的内存来训练，但低内存优化（LOMO）技术通过引入自适应学习率以及矩阵分解等方法，降低了内存需求并与 AdamW 优化器在大语言模型上表现相当。

AdaLomo：自适应学习率的低内存优化

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Large Language Models (LLMs) have revolutionized Natural Language Processing
(NLP) but demand massive GPU resources for training. Lowering the threshold for
LLMs training would encourage greater participation from researchers,
benefiting both academia and society. While existing approaches have focused on
parameter-efficient fine-tuning, which tunes or adds a small number of
parameters, few have addressed the challenge of tuning the full parameters of
LLMs with limited resources. In this work, we propose a new optimizer,
LOw-Memory Optimization (LOMO), which fuses the gradient computation and the
parameter update in one step to reduce memory usage. By integrating LOMO with
existing memory saving techniques, we reduce memory usage to 10.8% compared to
the standard approach (DeepSpeed solution). Consequently, our approach enables
the full parameter fine-tuning of a 65B model on a single machine with 8 RTX
3090, each with 24GB memory.

提出了一种名为 LOw-Memory Optimization（LOMO）的新优化器，该优化器将梯度计算和参数更新融合为一步，可以在单台机器上通过充分利用记忆方案使大型语言模型（LLMs）进行全参数微调的训练过程中降低内存使用，并成功地在一台装有 8 个 RTX 3090 的机器上对一个拥有 65B 参数的模型进行充分微调。