The advancement of deep learning has led to the emergence of
Mixture-of-Experts (MoEs) models, known for their dynamic allocation of
computational resources based on input. Despite their promise, MoEs face
challenges, particularly in terms of memory requirements. To address this, our
work introduces SEER-MoE, a novel two-stage framework for reducing both the
memory footprint and compute requirements of pre-trained MoE models. The first
stage involves pruning the total number of experts using a heavy-hitters
counting guidance, while the second stage employs a regularization-based
fine-tuning strategy to recover accuracy loss and reduce the number of
activated experts during inference. Our empirical studies demonstrate the
effectiveness of our method, resulting in a sparse MoEs model optimized for
inference efficiency with minimal accuracy trade-offs.

我们的研究引入了 SEER-MoE，这是一个新颖的两阶段框架，用于减少预训练 MoE 模型的内存占用和计算需求。第一阶段通过使用重要数据计数指导来修剪专家的总数，而第二阶段采用基于正则化的微调策略来恢复准确性损失并减少推断过程中激活的专家数量。我们的实证研究证明了我们的方法的有效性，使得经过优化的稀疏 MoEs 模型在推断效率方面具有最小的准确性妥协。