Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in
scaling up deep neural networks to an extreme scale. Despite that numerous
efforts have been made to improve the performance of MoE from the model design
or system optimization perspective, existing MoE dispatch patterns are still
not able to fully exploit the underlying heterogeneous network environments. In
this paper, we propose TA-MoE, a topology-aware routing strategy for
large-scale MoE trainging, from a model-system co-design perspective, which can
dynamically adjust the MoE dispatch pattern according to the network topology.
Based on communication modeling, we abstract the dispatch problem into an
optimization objective and obtain the approximate dispatch pattern under
different topologies. On top of that, we design a topology-aware auxiliary
loss, which can adaptively route the data to fit in the underlying topology
without sacrificing the model accuracy. Experiments show that TA-MoE can
substantially outperform its counterparts on various hardware and model
configurations, with roughly 1.01x-1.61x, 1.01x-4.77x, 1.25x-1.54x improvements
over the popular DeepSpeed-MoE, FastMoE and FasterMoE.

本文提出了一种基于拓扑感知路由策略的 Sparsely gated Mixture-of-Expert 深度神经网络模型，可以根据不同的拓扑结构动态调整传输模式，并通过辅助引导学习自适应地适应拓扑结构，实验结果表明该模型在各种硬件和模型配置中比其竞争对手表现表现更好，改进了 1.01x-1.61x, 1.01x-4.77x, 1.25x-1.54x。