Mixture-of-experts (MoE) models facilitate efficient scaling; however,
training the router network introduces the challenge of optimizing a
non-differentiable, discrete objective. Recently, a fully-differentiable MoE
architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges
experts in the parameter space; nevertheless, its effectiveness was only
demonstrated in downstream fine-tuning on classification tasks. In this paper,
we present Lory, the first approach that scales such architectures to
autoregressive language model pre-training. Lory introduces two key techniques:
(1) a causal segment routing strategy that achieves high efficiency for expert
merging operations while preserving the autoregressive nature of language
models; (2) a similarity-based data batching method that encourages expert
specialization by grouping similar documents in training instances. We
pre-train a series of Lory models on 150B tokens from scratch, with up to 32
experts and 30B (1.5B active) parameters. Experimental results show significant
performance gains over parameter-matched dense models on both perplexity
(+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level
routing, Lory models achieve competitive performance compared to
state-of-the-art MoE models with token-level routing. We further demonstrate
that the trained experts in Lory capture domain-level specialization without
supervision. Our work highlights the potential of fully-differentiable MoE
architectures for language model pre-training and advocates future research in
this area.

Lory 是一种全可微的混合专家模型架构，通过引入因果段路由策略和基于相似性的数据分批方法，实现了高效的专家融合运算和专家特化，该方法在自回归语言模型的预训练中取得了显著性能提升，在困惑度和多种下游任务上分别达到了 + 13.9% 和 + 1.5%-11.1% 的结果，同时证明了 Lory 的专家能够捕捉领域级别的特化。

Lory: 全可微的自回归语言模型预训练中的专家混合

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive  Language Model Pre-training

In this study, we systematically evaluate the impact of common design choices
in Mixture of Experts (MoEs) on validation performance, uncovering distinct
influences at token and sequence levels. We also present empirical evidence
showing comparable performance between a learned router and a frozen, randomly
initialized router, suggesting that learned routing may not be essential. Our
study further reveals that Sequence-level routing can result in topic-specific
weak expert specialization, in contrast to syntax specialization observed with
Token-level routing.

系统评估了混合专家模型中常见设计选择对验证性能的影响，发现了在令牌和序列层面上不同的影响。我们还提供了经验证据，表明学习路由和冻结、随机初始化的路由之间存在可比较的性能，暗示了学习路由可能并非必需。我们的研究进一步揭示了序列级路由可能导致特定主题的专家专业化不足，与令牌级路由观察到的句法专业化形成对比。

对混合专家模型设计选择的实证理解

Towards an empirical understanding of MoE design choices

In the era of large language models, Mixture-of-Experts (MoE) is a promising
architecture for managing computational costs when scaling up model parameters.
However, conventional MoE architectures like GShard, which activate the top-$K$
out of $N$ experts, face challenges in ensuring expert specialization, i.e.
each expert acquires non-overlapping and focused knowledge. In response, we
propose the DeepSeekMoE architecture towards ultimate expert specialization. It
involves two principal strategies: (1) finely segmenting the experts into $mN$
ones and activating $mK$ from them, allowing for a more flexible combination of
activated experts; (2) isolating $K_s$ experts as shared ones, aiming at
capturing common knowledge and mitigating redundancy in routed experts.
Starting from a modest scale with 2B parameters, we demonstrate that
DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5
times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly
approaches the performance of its dense counterpart with the same number of
total parameters, which set the upper bound of MoE models. Subsequently, we
scale up DeepSeekMoE to 16B parameters and show that it achieves comparable
performance with LLaMA2 7B, with only about 40% of computations. Further, our
preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently
validate its substantial advantages over the GShard architecture, and show its
performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%)
of computations.

在大型语言模型的时代，混合专家模型 (MoE) 在扩大模型参数时管理计算成本方面具有潜力。然而，传统的 MoE 架构（如 GShard）在确保专家专业化方面面临挑战。因此，我们提出了 DeepSeekMoE 架构，旨在实现终极的专家专业化。