Mixture-of-Experts (MoE) models have recently gained steam in achieving the
state-of-the-art performance in a wide range of tasks in computer vision and
natural language processing. They effectively expand the model capacity while
incurring a minimal increase in computation cost during training. However,
deploying such models for inference is difficult due to their large model size
and complex communication pattern. In this work, we provide a characterization
of two MoE workloads, namely Language Modeling (LM) and Machine Translation
(MT) and identify their sources of inefficiencies at deployment.
We propose three optimization techniques to mitigate sources of
inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert
load balancing. We show that dynamic gating improves execution time by
1.25-4$\times$ for LM, 2-5$\times$ for MT Encoder and 1.09-1.5$\times$ for MT
Decoder. It also reduces memory usage by up to 1.36$\times$ for LM and up to
1.1$\times$ for MT. We further propose Expert Buffering, a new caching
mechanism that only keeps hot, active experts in GPU memory while buffering the
rest in CPU memory. This reduces static memory allocation by 1.47$\times$. We
finally propose a load balancing methodology that provides additional
robustness to the workload. The code will be open-sourced upon acceptance.

本文提出了三种 Mixture-of-Experts （MoE）模型的优化技术，分别为动态门控、专家缓存和专家负载均衡，其中动态门控技术可以在多达 5 倍的性能提升的同时减少 GPU 内存的使用，而专家缓存技术可以通过只在 GPU 内存中缓存热门专家来减少最高可达 1.47 倍的静态内存分配。这些技术能够提高该 MoE 模型的效率并使得其更容易部署到实际应用中。