Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

本研究解决了在有限通信条件下，分布式深度学习中的梯度聚合效率问题。通过将聚合过程视为目标导向的子空间优化问题，提出了一种新的加权方案并引入子空间动量，以加快收敛速度，同时保持聚合的统计无偏性。实验结果表明，该方法在多个机器学习任务上优于传统的梯度平均方法，具有更高的效率。

针对大规模分布式训练的自适应共识梯度聚合