It is a challenging task to train large DNN models on sophisticated GPU
platforms with diversified interconnect capabilities. Recently, pipelined
training has been proposed as an effective approach for improving device
utilization. However, there are still several tricky issues to address:
improving computing efficiency while ensuring convergence, and reducing memory
usage without incurring additional computing costs. We propose DAPPLE, a
synchronous training framework which combines data parallelism and pipeline
parallelism for large DNN models. It features a novel parallelization strategy
planner to solve the partition and placement problems, and explores the optimal
hybrid strategy of data and pipeline parallelism. We also propose a new runtime
scheduling algorithm to reduce device memory usage, which is orthogonal to
re-computation approach and does not come at the expense of training
throughput. Experiments show that DAPPLE planner consistently outperforms
strategies generated by PipeDream's planner by up to 3.23x under synchronous
training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of
training throughput and reduces the memory consumption of 12% at the same time.

提出了一种称为 DAPPLE 的同步训练框架，它将数据并行和管道并行相结合，采用新颖的并行化策略规划器解决了分区和放置问题，并探索了数据和管道并行的最佳混合策略。与 GPipe 相比，DAPPLE 运行时间提高了 1.6 倍的训练吞吐量，并将内存消耗降低了 12%。

DAPPLE：一个用于训练大模型的流水线数据并行方法

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Distributed training of deep nets is an important technique to address some
of the present day computing challenges like memory consumption and
computational demands. Classical distributed approaches, synchronous or
asynchronous, are based on the parameter server architecture, i.e., worker
nodes compute gradients which are communicated to the parameter server while
updated parameters are returned. Recently, distributed training with AllReduce
operations gained popularity as well. While many of those operations seem
appealing, little is reported about wall-clock training time improvements. In
this paper, we carefully analyze the AllReduce based setup, propose timing
models which include network latency, bandwidth, cluster size and compute time,
and demonstrate that a pipelined training with a width of two combines the best
of both synchronous and asynchronous training. Specifically, for a setup
consisting of a four-node GPU cluster we show wall-clock time training
improvements of up to 5.4x compared to conventional approaches.

本文提出一种基于 AllReduce 的分布式深度学习训练方法，并通过在四节点 GPU 集群测试，证明具有两个宽度的流水线架构可以将同步和异步训练的优点相结合，可将训练时间提高多达 5.4 倍。