Mixture of Experts (MoE) models have emerged as a primary solution for
reducing the computational cost of Large Language Models. In this work, we
analyze their scaling properties, incorporating an expanded range of variables.
Specifically, we introduce a new hyperparameter, granularity, whose adjustment
enables precise control over the size of the experts. Building on this, we
establish scaling laws for fine-grained MoE, taking into account the number of
training tokens, model size, and granularity. Leveraging these laws, we derive
the optimal training configuration for a given computational budget. Our
findings not only show that MoE models consistently outperform dense
Transformers but also highlight that the efficiency gap between dense and MoE
models widens as we scale up the model size and training budget. Furthermore,
we demonstrate that the common practice of setting the size of experts in MoE
to mirror the feed-forward layer is not optimal at almost any computational
budget.

通过分析扩展的变量范围，我们建立了适用于细粒度混合专家模型的扩展规模定律，并利用这些规律为特定计算预算推导出最佳的训练配置，结果显示 Mixture of Experts 模型在规模和训练预算扩大时始终优于密集 Transformer 模型。此外，我们证明在几乎任何计算预算下，将专家的大小设置成与前馈层相似的常见做法并不是最优的。