Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

本研究解决了语言模型容量扩展中参数数量与每个样本计算量之间相互作用的重要性尚未充分理解的问题。通过探索稀疏混合专家模型中稀疏性对预训练和下游少样本评估性能的影响，发现存在一种最优稀疏水平，可以在不同的约束条件下提高训练效率和模型性能。这些发现为混合专家模型的扩展法则提供了更深的理解，并为设计更加高效的模型架构提供了新见解。

参数与运算量：混合专家语言模型最优稀疏性的扩展法则