Scaling large language models has revolutionized the performance across
diverse domains, yet the continual growth in model size poses significant
challenges for real-world deployment. The Mixture of Experts (MoE) approach
addresses this by dynamically selecting and activating only a subset of
experts, significantly reducing computational costs while maintaining high
performance. However, MoE introduces potential redundancy (e.g., parameters)
and extra costs (e.g., communication overhead). Despite numerous compression
techniques developed for mitigating the redundancy in dense models, the
compression of MoE remains under-explored. We first bridge this gap with a
cutting-edge unified framework that not only seamlessly integrates mainstream
compression methods but also helps systematically understand MoE compression.
This framework approaches compression from two perspectives: Expert Slimming
which compresses individual experts and Expert Trimming which removes
structured modules. Within this framework, we explore the optimization space
unexplored by existing methods,and further introduce aggressive Expert Trimming
techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger
scales. Based on these insights,we present a comprehensive recipe to guide
practitioners in compressing MoE effectively. Extensive experimental results
demonstrate the effectiveness of the compression methods under our framework
and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage
while maintaining over 92% of performance on Mixtral-8x7B.

大规模语言模型的扩展已经在不同领域取得了革命性的性能，但模型规模的持续增长为实际应用带来了重大挑战。本文通过动态选择和激活仅一部分专家的混合专家（MoE）方法，显著减少计算成本同时保持高性能。我们提出了一个创新的统一框架来压缩 MoE，该框架不仅无缝集成了主流压缩方法，还有助于系统地理解 MoE 压缩。在此框架中，我们从两个角度进行了压缩：专家瘦身，压缩单个专家；专家修剪，删除结构化模块。在这些基础上，我们介绍了一些激进的专家修剪技术，并提出了全面的指南，以指导从业者有效地进行 MoE 压缩。广泛的实验结果验证了我们框架下的压缩方法和指南的有效性，实现了 6.05 倍的加速和仅 20.0GB 的内存使用，同时保持了对 Mixtral-8x7B 超过 92% 的性能。