Recent work shows that decentralized parallel stochastic gradient decent (D-PSGD) can outperform its centralized counterpart both theoretically and practically. While asynchronous parallelism is a powerful technology to improve the efficiency of parallelism in distributed machine learning platforms and has been widely used in many popular machine learning softwares and solvers based on centralized parallel protocols such as Tensorflow, it still remains unclear how to apply the asynchronous parallelism to improve the efficiency of decentralized parallel algorithms. This paper proposes an asynchronous decentralize parallel stochastic gradient descent algorithm to apply the asynchronous parallelism technology to decentralized algorithms. Our theoretical analysis provides the convergence rate or equivalently the computational complexity, which is consistent with many special cases and indicates we can achieve nice linear speedup when we increase the number of nodes or the batchsize. Extensive experiments in deep learning validate the proposed algorithm.

本文提出了一种异步的分布式随机梯度下降算法（AD-PSGD）来解决异构环境下常用的同步算法（如AllReduce-SGD）和参数服务器 suffer from 的问题，并且在理论分析和经验结果上证明了 AD-PSGD 在异构环境下具有良好的收敛速度和通信效率优势。

异步分散并行随机梯度下降