Mixture-of-Experts (MoE) models have shown remarkable capability in
instruction tuning, especially when the number of tasks scales. However,
previous methods simply merge all training tasks (e.g. creative writing,
coding, and mathematics) and apply fixed sampling weights, without considering
the importance of different tasks as the model training state changes. In this
way, the most helpful data cannot be effectively distinguished, leading to
suboptimal model performance. To reduce the potential redundancies of datasets,
we make the first attempt and propose a novel dynamic data mixture for MoE
instruction tuning. Specifically, inspired by MoE's token routing preference,
we build dataset-level representations and then capture the subtle differences
among datasets. Finally, we propose to dynamically adjust the sampling weight
of datasets by their inter-redundancies, thus maximizing global performance
under a limited training budget. The experimental results on two MoE models
demonstrate the effectiveness of our approach on both downstream knowledge \&
reasoning tasks and open-ended queries. Code and models are available at
this https URL .

基于混合专家模型（Mixture-of-Experts），提出了一种动态数据混合的处理方法以优化模型性能，通过动态地调整训练数据的采样权重，减少数据集中的冗余，从而在有限的训练预算下最大化整体性能。

动态数据混合最大化专家混合模型的指令调优

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Mixture-of-experts (MoE) models facilitate efficient scaling; however,
training the router network introduces the challenge of optimizing a
non-differentiable, discrete objective. Recently, a fully-differentiable MoE
architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges
experts in the parameter space; nevertheless, its effectiveness was only
demonstrated in downstream fine-tuning on classification tasks. In this paper,
we present Lory, the first approach that scales such architectures to
autoregressive language model pre-training. Lory introduces two key techniques:
(1) a causal segment routing strategy that achieves high efficiency for expert
merging operations while preserving the autoregressive nature of language
models; (2) a similarity-based data batching method that encourages expert
specialization by grouping similar documents in training instances. We
pre-train a series of Lory models on 150B tokens from scratch, with up to 32
experts and 30B (1.5B active) parameters. Experimental results show significant
performance gains over parameter-matched dense models on both perplexity
(+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level
routing, Lory models achieve competitive performance compared to
state-of-the-art MoE models with token-level routing. We further demonstrate
that the trained experts in Lory capture domain-level specialization without
supervision. Our work highlights the potential of fully-differentiable MoE
architectures for language model pre-training and advocates future research in
this area.

Lory 是一种全可微的混合专家模型架构，通过引入因果段路由策略和基于相似性的数据分批方法，实现了高效的专家融合运算和专家特化，该方法在自回归语言模型的预训练中取得了显著性能提升，在困惑度和多种下游任务上分别达到了 + 13.9% 和 + 1.5%-11.1% 的结果，同时证明了 Lory 的专家能够捕捉领域级别的特化。

Lory: 全可微的自回归语言模型预训练中的专家混合

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive  Language Model Pre-training

The advancement of deep learning has led to the emergence of
Mixture-of-Experts (MoEs) models, known for their dynamic allocation of
computational resources based on input. Despite their promise, MoEs face
challenges, particularly in terms of memory requirements. To address this, our
work introduces SEER-MoE, a novel two-stage framework for reducing both the
memory footprint and compute requirements of pre-trained MoE models. The first
stage involves pruning the total number of experts using a heavy-hitters
counting guidance, while the second stage employs a regularization-based
fine-tuning strategy to recover accuracy loss and reduce the number of
activated experts during inference. Our empirical studies demonstrate the
effectiveness of our method, resulting in a sparse MoEs model optimized for
inference efficiency with minimal accuracy trade-offs.

我们的研究引入了 SEER-MoE，这是一个新颖的两阶段框架，用于减少预训练 MoE 模型的内存占用和计算需求。第一阶段通过使用重要数据计数指导来修剪专家的总数，而第二阶段采用基于正则化的微调策略来恢复准确性损失并减少推断过程中激活的专家数量。我们的实证研究证明了我们的方法的有效性，使得经过优化的稀疏 MoEs 模型在推断效率方面具有最小的准确性妥协。