Mixture-of-experts (MoE) is gaining increasing attention due to its unique
properties and remarkable performance, especially for language tasks. By
sparsely activating a subset of parameters for each token, MoE architecture
could increase the model size without sacrificing computational efficiency,
achieving a better trade-off between performance and training costs. However,
the underlying mechanism of MoE still lacks further exploration, and its
modularization degree remains questionable. In this paper, we make an initial
attempt to understand the inner workings of MoE-based large language models.
Concretely, we comprehensively study the parametric and behavioral features of
three recent MoE-based models and reveal some intriguing observations,
including (1) Neurons act like fine-grained experts. (2) The router of MoE
usually selects experts with larger output norms. (3) The expert diversity
increases as the layer increases, while the last layer is an outlier. Based on
the observations, we also provide suggestions for a broad spectrum of MoE
practitioners, such as router design and expert allocation. We hope this work
could shed light on future research on the MoE framework and other modular
architectures. Code is available at
this https URL

Mixture-of-experts (MoE) 的内在机制及行为特征的初步研究表明神经元如同细粒度专家，在参数和行为特征方面带来了一些有趣的观察，为 MoE 框架和其他模块化架构的未来研究提供了启示。

大型语言模型中混合专家的更深入研究

A Closer Look into Mixture-of-Experts in Large Language Models

The Mixture of Experts (MoE) is a widely known neural architecture where an
ensemble of specialized sub-models optimizes overall performance with a
constant computational cost. However, conventional MoEs pose challenges at
scale due to the need to store all experts in memory. In this paper, we push
MoE to the limit. We propose extremely parameter-efficient MoE by uniquely
combining MoE architecture with lightweight experts.Our MoE architecture
outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on
par with full fine-tuning by only updating the lightweight experts -- less than
1% of an 11B parameters model. Furthermore, our method generalizes to unseen
tasks as it does not depend on any prior task knowledge. Our research
underscores the versatility of the mixture of experts architecture, showcasing
its ability to deliver robust performance even when subjected to rigorous
parameter constraints. Our code used in all the experiments is publicly
available here: this https URL

我们的研究展示了混合专家架构的多样性，即使在严格的参数约束下，也能提供稳健的性能，并通过唯一地将 MoE 架构与轻量级专家相结合，提出了极其高效的 MoE 架构，推动了 MoE 的极限。

推动专家混合模型的极限：非常参数高效的指令调优 MoE

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient  MoE for Instruction Tuning

Mixture-of-experts (MoE) architecture has been proven a powerful method for
diverse tasks in training deep models in many applications. However, current
MoE implementations are task agnostic, treating all tokens from different tasks
in the same manner. In this work, we instead design a novel method that
incorporates task information into MoE models at different granular levels with
shared dynamic task-based adapters. Our experiments and analysis show the
advantages of our approaches over the dense and canonical MoE models on
multi-task multilingual machine translations. With task-specific adapters, our
models can additionally generalize to new tasks efficiently.

我们设计了一种新方法，将任务信息与 Mixture-of-experts 模型结合，通过共享的动态任务适配器在不同粒度级别上将任务信息融入模型中。实验证明，相比密集和经典的 Mixture-of-experts 模型，在多任务多语言机器翻译上，我们的方法具有优势。通过任务特定的适配器，我们的模型能够高效地泛化到新任务中。

基于任务的 MoE 多任务多语言机器翻译

Task-Based MoE for Multitask Multilingual Machine Translation

Pretraining on a large-scale corpus has become a standard method to build
general language models (LMs). Adapting a model to new data distributions
targeting different downstream tasks poses significant challenges. Naive
fine-tuning may incur catastrophic forgetting when the over-parameterized LMs
overfit the new data but fail to preserve the pretrained features. Lifelong
learning (LLL) aims to enable information systems to learn from a continuous
data stream across time. However, most prior work modifies the training recipe
assuming a static fixed network architecture. We find that additional model
capacity and proper regularization are key elements to achieving strong LLL
performance. Thus, we propose Lifelong-MoE, an extensible MoE
(Mixture-of-Experts) architecture that dynamically adds model capacity via
adding experts with regularized pretraining. Our results show that by only
introducing a limited number of extra experts while keeping the computation
cost constant, our model can steadily adapt to data distribution shifts while
preserving the previous knowledge. Compared to existing lifelong learning
approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream
NLP tasks.

本文提出了 Lifelong-MoE，一种基于扩展的 MoE（Expansive Mixture-of-Experts）架构的 Lifelong Learning 方法，其具有更好的 few-shot 性能，可以对大规模语料进行更好的预训练，适应不同的下游任务。