Pipeline parallelism has been widely explored, but most existing schedules
lack a systematic methodology. In this paper, we propose a framework to
decompose pipeline schedules as repeating a building block and we show that the
lifespan of the building block decides the peak activation memory of the
pipeline schedule. Guided by the observations, we find that almost all existing
pipeline schedules, to the best of our knowledge, are memory inefficient. To
address this, we introduce a family of memory efficient building blocks with
controllable activation memory, which can reduce the peak activation memory to
1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable
throughput. We can also achieve almost zero pipeline bubbles while maintaining
the same activation memory as 1F1B. Our evaluations demonstrate that in pure
pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in
terms of throughput. When employing a grid search over hybrid parallelism
hyperparameters in practical scenarios, our proposed methods demonstrate a 16%
throughput improvement over the 1F1B baseline for large language models.

通过重复构建块的方式，将流水线调度分解为生命周期的概念，并发现现有调度方案在内存利用方面存在问题。为了解决这个问题，引入了一系列具有可控激活内存的高效构建块，可以在不影响效率的情况下将峰值激活内存减少到 1F1B 的 1/2 甚至 1/3，同时几乎没有流水线气泡，以及在吞吐量方面表现优于 1F1B。在实践场景中对混合并行化超参数进行网格搜索时，相较于 1F1B 基准，我们提出的方法在大型语言模型上实现了 16% 的吞吐量提升。

可控内存的管道并行

Pipeline Parallelism with Controllable Memory

Large deep learning models have achieved impressive performance across a
range of applications. However, their large memory requirements, including
parameter memory and activation memory, have become a significant challenge for
their practical serving. While existing methods mainly address parameter
memory, the importance of activation memory has been overlooked. Especially for
long input sequences, activation memory is expected to experience a significant
exponential growth as the length of sequences increases. In this approach, we
propose AutoChunk, an automatic and adaptive compiler system that efficiently
reduces activation memory for long sequence inference by chunk strategies. The
proposed system generates chunk plans by optimizing through multiple stages. In
each stage, the chunk search pass explores all possible chunk candidates and
the chunk selection pass identifies the optimal one. At runtime, AutoChunk
employs code generation to automatically apply chunk strategies. The
experiments demonstrate that AutoChunk can reduce over 80\% of activation
memory while maintaining speed loss within 10%, extend max sequence length by
3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.

通过自动和自适应的编译器系统，我们提出了 AutoChunk 方法，该方法通过块策略有效减少了长序列推理中的激活内存，证明了 AutoChunk 可以在保持速度损失在 10% 以内的同时，减少 80% 的激活内存，并将最大序列长度提高 3.2 倍至 11.7 倍，大大优于现有方法。

AutoChunk: 自动激活块用于高效存储长序列推理

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence  Inference

The low-rank adaptation (LoRA) method can largely reduce the amount of
trainable parameters for fine-tuning large language models (LLMs), however, it
still requires expensive activation memory to update low-rank weights. Reducing
the number of LoRA layers or using activation recomputation could harm the
fine-tuning performance or increase the computational overhead. In this work,
we present LoRA-FA, a memory-efficient fine-tuning method that reduces the
activation memory without performance degradation and expensive recomputation.
LoRA-FA chooses to freeze the projection-down weight of $A$ and update the
projection-up weight of $B$ in each LoRA layer. It ensures the change of model
weight reside in a low-rank space during LLMs fine-tuning, while eliminating
the requirement to store full-rank input activations. We conduct extensive
experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales.
Our results show that LoRA-FA can always achieve close fine-tuning accuracy
across different tasks compared to full parameter fine-tuning and LoRA.
Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$
compared to LoRA.

LoRA-FA 采用低内存量的权重更新方式，用于大型语言模型的微调，具有接近完整参数微调的准确性，降低了内存使用，技术优化了 LoRA。

LoRA-FA: 内存高效的大语言模型低秩适应微调

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models  Fine-tuning

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs)
has emerged as a highly successful approach, with training only a small number
of parameters without sacrificing performance and becoming the de-facto
learning paradigm with the increasing size of PLMs. However, existing PEFT
methods are not memory-efficient, because they still require caching most of
the intermediate activations for the gradient calculation, akin to fine-tuning.
One effective way to reduce the activation memory is to apply a reversible
model, so the intermediate activations are not necessary to be cached and can
be recomputed. Nevertheless, modifying a PLM to its reversible variant with
PEFT is not straightforward, since the reversible model has a distinct
architecture from the currently released PLMs. In this paper, we first
investigate what is a key factor for the success of existing PEFT methods, and
realize that it's essential to preserve the PLM's starting point when
initializing a PEFT method. With this finding, we propose memory-efficient
fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's
starting point and making it reversible without additional pre-training. We
evaluate MEFT on the GLUE benchmark and five question-answering tasks with
various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the
activation memory up to 84% of full fine-tuning with a negligible amount of
trainable parameters. Moreover, MEFT achieves the same score on GLUE and a
comparable score on the question-answering tasks as full fine-tuning.

本文提出了一种内存高效的微调方法（MEFT），通过在预训练语言模型中插入适配器以保留 PLM 的起点并使其可逆，同时将激活内存降低到 84％的完全微调水平，并在 GLUE 基准测试中实现与完全微调相同的分数。