Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, RegTop-$k$ achieves about $8\%$ higher accuracy than standard Top-$k$.

本研究解决了分布式梯度下降中Top-$k$稀疏化方法的误差积累问题，这种方法可能会影响收敛性。提出的正则化Top-$k$（RegTop-$k$）算法通过最大后验估计确定贝叶斯最优稀疏化掩码，有效控制学习率的缩放。实验结果表明，在$0.1\%$的稀疏化率下，RegTop-$k$在ResNet-18和CIFAR-10数据集上的准确率比标准Top-$k$高出约$8\%$，显示出显著的提升潜力。

通过贝叶斯推断的新型梯度稀疏化算法