We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients---compressed gradients under error compensation and for localSGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.

本文研究了在平滑拟凸和非凸函数上的随机梯度下降法（SGD）进行延迟更新，并得出了简洁的非渐近收敛速度。我们证明了在所有情况下收敛速度的由两个项组成：（i）一个随机项，不受延迟的影响，和（ii）一个更高阶的确定性项，只是通过延迟线性减缓。因此，在存在噪声的情况下，延迟的影响在几次迭代后变得微不足道，算法以与标准 SGD 相同的最优速度收敛。我们进一步展示了在使用层压梯度（compressed gradients）进行错误补偿时以及在多个节点上做本地 SGD 之后通信的情况下，与现有最佳算法相比，我们得到了更好的结果。这些结果表明 SGD 对于压缩和/或延迟的随机梯度更新是具有鲁棒性的。这对于分布式并行实现特别重要，因为异步和通信高效方法是实现多设备优化的线性加速的关键。

误差反馈框架：延迟梯度和压缩通信下提高SGD速率