Mixture-of-Experts (MoE) models have shown remarkable capability in
instruction tuning, especially when the number of tasks scales. However,
previous methods simply merge all training tasks (e.g. creative writing,
coding, and mathematics) and apply fixed sampling weights, without considering
the importance of different tasks as the model training state changes. In this
way, the most helpful data cannot be effectively distinguished, leading to
suboptimal model performance. To reduce the potential redundancies of datasets,
we make the first attempt and propose a novel dynamic data mixture for MoE
instruction tuning. Specifically, inspired by MoE's token routing preference,
we build dataset-level representations and then capture the subtle differences
among datasets. Finally, we propose to dynamically adjust the sampling weight
of datasets by their inter-redundancies, thus maximizing global performance
under a limited training budget. The experimental results on two MoE models
demonstrate the effectiveness of our approach on both downstream knowledge \&
reasoning tasks and open-ended queries. Code and models are available at
this https URL .

基于混合专家模型（Mixture-of-Experts），提出了一种动态数据混合的处理方法以优化模型性能，通过动态地调整训练数据的采样权重，减少数据集中的冗余，从而在有限的训练预算下最大化整体性能。

动态数据混合最大化专家混合模型的指令调优

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality
on various language tasks, including machine translation task, thanks to the
efficient model scaling capability with expert parallelism. However, it has
brought a fundamental issue of larger memory consumption and increased memory
bandwidth bottleneck at deployment time. In this paper, we propose Mixture of
Quantized Experts (MoQE) which is a simple weight-only quantization method
applying ultra low-bit down to 2-bit quantizations only to expert weights for
mitigating the increased memory and latency issues of MoE models. We show that
low-bit quantization together with the MoE architecture delivers a reliable
model performance while reducing the memory size significantly even without any
additional training in most cases. In particular, expert layers in MoE models
are much more robust to the quantization than conventional feedforward networks
(FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit
expert weights can deliver better model performance than the dense model
trained on the same dataset. As a result of low-bit quantization, we show the
model size can be reduced by 79.6% of the original half precision floating
point (fp16) MoE model. Combined with an optimized GPU runtime implementation,
it also achieves 1.24X speed-up on A100 GPUs.

提出了一种名为 Mixture of Quantized Experts (MoQE) 的简单权重量化方法，可以降低 Mixture of Experts (MoE) 模型的内存消耗和延迟问题，同时保持可靠的模型性能，并可在大多数情况下显著减小模型大小。

分量化专家混合 (MoQE): 低位量化和鲁棒性的互补效果

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit  Quantization and Robustness

Mixture of experts (MoE) is a popular technique in deep learning that
improves model capacity with conditionally-activated parallel neural network
modules (experts). However, serving MoE models in resource-constrained
latency-critical edge scenarios is challenging due to the significantly
increased model size and complexity. In this paper, we first analyze the
behavior pattern of MoE models in continuous inference scenarios, which leads
to three key observations about the expert activations, including temporal
locality, exchangeability, and skippable computation. Based on these
observations, we introduce PC-MoE, an inference framework for
resource-constrained continuous MoE model serving. The core of PC-MoE is a new
data structure, Parameter Committee, that intelligently maintains a subset of
important experts in use to reduce resource consumption. The optimal
configuration of Parameter Committee is found offline by a profiling-guided
committee planner, and expert swapping and request handling at runtime are
managed by an adaptive committee scheduler. To evaluate the effectiveness of
PC-MoE, we conduct experiments using state-of-the-art MoE models on common
computer vision and natural language processing tasks. The results demonstrate
optimal trade-offs between resource consumption and model accuracy achieved by
PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our
approach can reduce memory usage and latency by 42.34% and 18.63% with only
0.10% accuracy degradation.

基于连续推理的资源受限混合专家模型 (PC-MoE) 能有效降低资源消耗和增加模型准确性。

通过动态专家交换在资源受限的边缘设备上提供 MoE 模型服务

Serving MoE Models on Resource-constrained Edge Devices via Dynamic  Expert Swapping

Sparsely activated transformers, such as Mixture of Experts (MoE), have
received great interest due to their outrageous scaling capability which
enables dramatical increases in model size without significant increases in
computational cost. To achieve this, MoE models replace the feedforward
sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating
network to route each token to its assigned experts. Since the common practice
for efficient training of such models requires distributing experts and tokens
across different machines, this routing strategy often incurs huge
cross-machine communication cost because tokens and their assigned experts
likely reside in different machines. In this paper, we propose \emph{Gating
Dropout}, which allows tokens to ignore the gating network and stay at their
local machines, thus reducing the cross-machine communication. Similar to
traditional dropout, we also show that Gating Dropout has a regularization
effect during training, resulting in improved generalization performance. We
validate the effectiveness of Gating Dropout on multilingual machine
translation tasks. Our results demonstrate that Gating Dropout improves a
state-of-the-art MoE model with faster wall-clock time convergence rates and
better BLEU scores for a variety of model sizes and datasets.

本研究提出了一种名为 Gating Dropout 的方法，它可以减少深度学习模型的跨机器通讯成本，并在多语言机器翻译任务中验证了其有效性。