By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve model's parameter efficiency. We validate the effectiveness of our method by pruning two state-of-the-art MoE models, Mixtral-8x7B and Mixtral-8x22B. Evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. To facilitate future research, we will release our code and the pruned MoE models.

通过将模型参数增加但在执行任务时仅激活其中一部分，混合专家（MoE）架构明显提高了大型语言模型（LLMs）的性能而不增加推理成本。然而，由于专家数量增加而导致的内存消耗对于这些模型在实际应用中的部署构成了挑战。我们的经验研究发现，一些专家在预训练期间编码了冗余的知识。因此，我们提出了一种将相似专家分组并修剪以提高模型参数效率的方法。我们通过修剪Mixtral-8x7B和Mixtral-8x22B两个最先进的MoE模型来验证我们方法的有效性。评估结果显示，我们的方法在各种自然语言任务上优于其他模型修剪方法。为了便于未来研究，我们将发布我们的代码和修剪过的MoE模型。

稀疏混合专家下的任务不可知剪枝中的专家知识多样化