Fine-tuning pre-trained large language models (LLMs) with limited hardware
presents challenges due to GPU memory constraints. Various distributed
fine-tuning methods have been proposed to alleviate memory constraints on GPU.
However, determining the most effective method for achieving rapid fine-tuning
while preventing GPU out-of-memory issues in a given environment remains
unclear. To address this challenge, we introduce LLMem, a solution that
estimates the GPU memory consumption when applying distributed fine-tuning
methods across multiple GPUs and identifies the optimal method. We conduct GPU
memory usage estimation prior to fine-tuning, leveraging the fundamental
structure of transformer-based decoder models and the memory usage distribution
of each method. Experimental results show that LLMem accurately estimates peak
GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally,
it shows an average error rate of 3.0% when applying distributed fine-tuning
methods to LLMs with more than a billion parameters on multi-GPU setups.

LLMem 是一种在有限硬件条件下对大型语言模型进行微调的解决方案，通过估计多个 GPU 上分布式微调方法的 GPU 内存消耗，并确定最优方法，有效应对 GPU 内存限制和快速微调的挑战。

LLMem：用于微调预训练 LLM 模型的估算 GPU 内存使用量

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

With the rapid adoption of machine learning (ML), a number of domains now use
the approach of fine tuning models which were pre-trained on a large corpus of
data. However, our experiments show that even fine-tuning on models like BERT
can take many hours even when using modern accelerators like GPUs. While prior
work proposes limiting the number of layers that are fine-tuned, e.g., freezing
all layers but the last layer, we find that such static approaches lead to
reduced accuracy. We propose, AutoFreeze, a system that uses an adaptive
approach to choose which layers are trained and show how this can accelerate
model fine-tuning while preserving accuracy. We also develop mechanisms to
enable efficient caching of intermediate activations which can reduce the
forward computation time when performing fine-tuning. We extend AutoFreeze to
perform distributed fine-tuning and design two execution modes that minimize
cost and running time respectively. Our evaluation on ten NLP tasks shows that
AutoFreeze, with caching enabled, can improve fine-tuning on a single GPU by up
to 2.55x. On a 64 GPU cluster, for fine-tuning on the AG's news dataset,
AutoFreeze is able to achieve up to 4.38x speedup when optimizing for
end-to-end training time and 5.03x reduction in total cost when optimizing for
efficiency, without affecting model accuracy.

本研究提出了 AutoFreeze 系统，通过自适应选择训练的层并设计了两种执行模式，实现了在保持模型准确性的同时加速 fine-tuning，使用缓存技术在单个 GPU 上能够改善 fine-tuning 的速度达到最多 2.55 倍，在 64 个 GPU 集群上的 fine-tuning 速度达到最多 4.38 倍，并实现了 5.03 倍的总成本降低。