Mixture-of-Experts (MoE) language models can reduce computational costs by
2-4$\times$ compared to dense models without sacrificing performance, making
them more efficient in computation-bounded scenarios. However, MoE models
generally require 2-4$\times$ times more parameters to achieve comparable
performance to a dense model, which incurs larger GPU memory requirements and
makes MoE models less efficient in I/O-bounded scenarios like autoregressive
generation. In this work, we propose a hybrid dense training and sparse
inference framework for MoE models (DS-MoE) which achieves strong computation
and parameter efficiency by employing dense computation across all experts
during training and sparse computation during inference. Our experiments on
training LLMs demonstrate that our DS-MoE models are more parameter-efficient
than standard sparse MoEs and are on par with dense models in terms of total
parameter size and performance while being computationally cheaper (activating
30-40% of the model's parameters). Performance tests using vLLM show that our
DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like
Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable
MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

通过采用密集计算进行训练和稀疏计算进行推理的混合密集与稀疏混合模型 (DS-MoE)，在保持性能的同时实现了强大的计算和参数效率，比标准稀疏 MoE 更具参数效率，在总参数大小和性能方面与密集模型持平，而且计算成本更低。

密集训练，稀疏推断：重思混合专家语言模型的训练

Dense Training, Sparse Inference: Rethinking Training of  Mixture-of-Experts Language Models

Mixture of Experts layers (MoEs) enable efficient scaling of language models
through conditional computation. This paper presents a detailed empirical study
of how autoregressive MoE language models scale in comparison with dense models
in a wide range of settings: in- and out-of-domain language modeling, zero- and
few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning,
we find MoEs to be substantially more compute efficient. At more modest
training budgets, MoEs can match the performance of dense models using $\sim$4
times less compute. This gap narrows at scale, but our largest MoE model (1.1T
parameters) consistently outperforms a compute-equivalent dense model (6.7B
parameters). Overall, this performance gap varies greatly across tasks and
domains, suggesting that MoE and dense models generalize differently in ways
that are worthy of future study. We make our code and models publicly available
for research use.

本文研究了自回归 MoE 语言模型在各种设置下与密集模型的规模比较，并发现除了 fine-tuning 以外，在相同预算下 MoE 模型比密集模型更加高效。该研究表明 MoE 和密集模型在任务和领域上的推广效果不同，值得进一步研究。