We introduce Breadth-First Pipeline Parallelism, a novel training schedule
which optimizes the combination of pipeline and data parallelism. Breadth-First
Pipeline Parallelism lowers training time, cost and memory usage by combining a
high GPU utilization with a small batch size per GPU, and by making use of
fully sharded data parallelism. Experimentally, we observed increases of up to
53% in training speed.

引入 Breadth-First Pipeline Parallelism，这是一种结合了 pipeline 和 data parallelism 的新型训练策略，通过充分利用 GPU 和每个 GPU 上的小 batch size 的特性，以及完全分片的数据并行性，降低了训练时间、成本和内存使用。实验结果显示训练速度提高了 53%。

广度优先流水线并行

Breadth-First Pipeline Parallelism

Deep learning algorithms can fare poorly when the training dataset suffers
from heavy class-imbalance but the testing criterion requires good
generalization on less frequent classes. We design two novel methods to improve
performance in such scenarios. First, we propose a theoretically-principled
label-distribution-aware margin (LDAM) loss motivated by minimizing a
margin-based generalization bound. This loss replaces the standard
cross-entropy objective during training and can be applied with prior
strategies for training with class-imbalance such as re-weighting or
re-sampling. Second, we propose a simple, yet effective, training schedule that
defers re-weighting until after the initial stage, allowing the model to learn
an initial representation while avoiding some of the complications associated
with re-weighting or re-sampling. We test our methods on several benchmark
vision tasks including the real-world imbalanced dataset iNaturalist 2018. Our
experiments show that either of these methods alone can already improve over
existing techniques and their combination achieves even better performance
gains.

为了解决深度学习在类别分布不平衡的情况下训练表现差的问题，本研究提出了两种新的方法：一、设计了基于理论的标签分布感知边界 (LDAM) 损失函数；二、提出了一种简单而有效的训练策略来推迟重新加权，并在减轻权重的复杂性的同时实现模型对初始表示的学习，实验结果表明这两种方法能够提高模型性能。