Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.

在本研究中，我们提出了一种新的维度，即在单个训练序列中执行流水线并行处理，以用于Transformer-based语言模型的高效训练，并开发了一种基于动态规划的算法TeraPipe，用于进行同步模型并行训练。我们证明，TeraPipe可以在使用48个p3.16xlarge实例的AWS集群上将最大的包含1750亿参数的GPT-3模型的训练速度提高5.0倍，相比最先进的模型并行方法，具有更细粒度的流水线并行处理。

TeraPipe：用于大规模语言模型训练的令牌级管道并行化