Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

本文挑战了常规信念，提出了一种完全新的角度来理解分散学习，证明了分散随机梯度下降隐含地最小化了一种平均方向锐度感知最小化算法的损失函数，在常规非凸非 $/beta/$ -平滑设置下的这种惊人的渐近等价关系揭示了一种本质上的正则化-优化权衡和分散的三个优点。

去中心化SGD和平均方向SAM在渐近情况下等价