Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13$\times$ to 5.77$\times$ speedup on 1296 manually configured MoE layers and approximately 3$\times$ improvement on two real-world MoE models based on BERT and GPT-2.

Parm是一个加速MP+EP+ESP训练的系统，通过设计两个专用调度来消除冗余计算和通信任务，实现节点内和节点间通信的重叠，从而减少总体训练时间。在8-GPU服务器和32-GPU集群上的实验结果表明，Parm优于最先进的MoE训练系统DeepSpeed-MoE，在1296个手动配置的MoE层上获得1.13倍至5.77倍的加速，并在基于BERT和GPT-2的两个真实MoE模型上实现了大约3倍的改进。

Parm: 大规模稀疏激活模型的高效训练与专用计划