To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results show that LAGS-SGD achieves from around 40\% to 95\% of the maximum benefit of pipelining on a 16-node GPU cluster. Combining the benefit of pipelining and sparsification, the speedup of LAGS-SGD over S-SGD ranges from 2.86$\times$ to 8.52$\times$ on our tested CNN and LSTM models, without losing obvious model accuracy.

本文提出了一种新的分布式优化方法LAGS-SGD，它结合了S-SGD与一种新的LAGS方案，通过采用分层自适应梯度稀疏来减少通信负担，实现了通信和计算之间的重叠，同时保证了收敛性能。在16-GPU群集上的实验结果表明，LAGS-SGD在不失精度的情况下优于原始的S-SGD和现有的稀疏S-SGD。

分布式深度学习中的层自适应梯度稀疏化及收敛性保证