We investigate efficient methods for training Large Language Models (LLMs) to
possess capabilities in multiple specialized domains, such as coding, math
reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts
from a seed model, which is branched to train experts in embarrassingly
parallel fashion with high throughput and reduced communication cost. After
individual experts are asynchronously trained, BTX brings together their
feedforward parameters as experts in Mixture-of-Expert (MoE) layers and
averages the remaining parameters, followed by an MoE-finetuning stage to learn
token-level routing. BTX generalizes two special cases, the Branch-Train-Merge
method, which does not have the MoE finetuning stage to learn routing, and
sparse upcycling, which omits the stage of training experts asynchronously.
Compared to alternative approaches, BTX achieves the best accuracy-efficiency
tradeoff.

我们研究了训练大型语言模型在多个专业领域（如编码、数学推理和世界知识）中具备能力的高效方法。我们的方法名为 Branch-Train-MiX（BTX），从一个种子模型开始，在高吞吐量和减少通信成本的尴尬地并行训练专家。在专家异步训练后，BTX 将它们的前馈参数作为混合专家（MoE）层的专家团队，并平均剩余参数，接着采用 MoE 微调阶段学习标记级别的路由。BTX 推广了两种特殊情况，Branch-Train-Merge 方法不需要 MoE 微调阶段学习路由，而稀疏升级则省略了专家异步训练阶段。与替代方法相比，BTX 在准确性和效率之间取得了最佳平衡。

分支训练 MiX：将专家 LLMs 混合到一个专家混合 LLM 中

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space
where better perplexity typically comes at the cost of greater computation
complexity. In a software keyboard application on mobile devices, this
translates into higher power consumption and shorter battery life. This paper
represents the first attempt, to our knowledge, in exploring
accuracy-efficiency tradeoffs for NLMs. Building on quasi-recurrent neural
networks (QRNNs), we apply pruning techniques to provide a "knob" to select
different operating points. In addition, we propose a simple technique to
recover some perplexity using a negligible amount of memory. Our empirical
evaluations consider both perplexity as well as energy consumption on a
Raspberry Pi, where we demonstrate which methods provide the best
perplexity-power consumption operating point. At one operating point, one of
the techniques is able to provide energy savings of 40% over the state of the
art with only a 17% relative increase in perplexity.

本文通过在 quasi-recurrent neural networks (QRNNs) 基础上应用剪枝技术来提供一种选择不同操作点的 “旋钮”，并提出一种使用可忽略量的内存恢复一些 perplexity 的简单技术，并在树莓派上考虑 perplexity 和能耗两方面的实证评估，证明了哪种方法能提供最佳的 perplexity - 能耗操作点，其中一种技术能够在一个操作点上相对于现有技术，提供 40％的能量节省和仅 17％的相对 perplexity 增加。