Parallelism is crucial for accelerating the training of deep neural networks. Pipeline parallelism can provide an efficient alternative to traditional data parallelism by allowing workers to specialize. Performing mini-batch SGD using pipeline parallelism has the overhead of filling and draining the pipeline. Pipelined Backpropagation updates the model parameters without draining the pipeline. This removes the overhead but introduces stale gradients and inconsistency between the weights used on the forward and backward passes, reducing final accuracy and the stability of training. We introduce Spike Compensation and Linear Weight Prediction to mitigate these effects. Analysis on a convex quadratic shows that both methods effectively counteract staleness. We train multiple convolutional networks at a batch size of one, completely replacing batch parallelism with fine-grained pipeline parallelism. With our methods, Pipelined Backpropagation achieves full accuracy on CIFAR-10 and ImageNet without hyperparameter tuning.

本文研究了深度神经网络的硬件加速器，并提出了一种具有硬件优势的异步管道并行训练算法。通过引入Spike Compensation和Linear Weight Prediction两种方法，它有效地减轻了由Pipelined Backpropagation的异步性造成的缺点，并优于现有技术。适当的标准化和小批量大小也有助于训练，与SGD相比，它能够在CIFAR-10和ImageNet上为多个网络的训练匹配准确度。

规模化流水线反向传播：无批量训练大型模型