Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

基于随机时滞微分方程（SDDE）和概率到达梯度泊松近似，提出了统一框架来分析和优化异步分布式随机梯度下降（ASGD）的收敛性，揭示了ASGD的阻尼系数、延迟统计特性和收敛条件，以及调度策略的优化。同时表明增加激活的工作人员数量并不一定加速分布式随机梯度下降，而较大程度的时滞将导致其发散。通过数值实验验证了SDDE框架在复杂学习任务中具有的潜力。

带有延迟的分布式随机梯度下降：基于随机延迟微分方程的框架