Large Language Models (LLMs) have achieved remarkable results, but their
increasing resource demand has become a major obstacle to the development of
powerful and accessible super-human intelligence. This report introduces
JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens
from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its
low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B
outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the
Llama2-13B-Chat model. These results suggest that LLM training can be much more
cost-effective than generally thought. JetMoE-8B is based on an efficient
Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention
and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B
to have 8B parameters while only activating 2B for each input token, reducing
inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B
is highly open and academia-friendly, using only public datasets and training
code. All training parameters and data mixtures have been detailed in this
report to facilitate future efforts in the development of open foundation
models. This transparency aims to encourage collaboration and further
advancements in the field of accessible and efficient LLMs. The model weights
are publicly available at this https URL

JetMoE-8B 是一种高性价比、透明和学术友好的基于 Sparsely-gated Mixture-of-Experts (SMoE) 架构的大型语言模型，仅需要不足 10 万美元的培训成本，拥有 8B 个参数，使用公共数据集和训练代码，能在降低推理计算约 70% 的情况下实现令人瞩目的性能表现，为开放的基础模型的发展提供了透明而鼓励合作的方式。

JetMoE：以 0.1M 美元达到 Llama2 性能

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

Expert parallelism has been introduced as a strategy to distribute the
computational workload of sparsely-gated mixture-of-experts (MoE) models across
multiple computing devices, facilitating the execution of these increasingly
large-scale models. However, the All-to-All communication intrinsic to expert
parallelism constitutes a significant overhead, diminishing the MoE models'
efficiency. Current optimization approaches offer some relief, yet they are
constrained by the sequential interdependence of communication and computation
operations. To address this limitation, we present a novel shortcut-connected
MoE architecture with overlapping parallel strategy, designated as ScMoE, which
effectively decouples communication from its conventional sequence, allowing
for a substantial overlap of 70% to 100% with computation. When compared with
the prevalent top-2 MoE architecture, ScMoE demonstrates training speed
improvements of 30% and 11%, and inference improvements of 40% and 15%, in our
PCIe and NVLink hardware environments, respectively, where communication
constitutes 60% and 15% of the total MoE time consumption. On the other hand,
extensive experiments and theoretical analyses indicate that ScMoE not only
achieves comparable but in some instances surpasses the model quality of
existing approaches in vision and language tasks.

提出了一种名为 ScMoE 的新型快捷连接的 MoE 架构，通过重叠并行策略有效地将通信与传统序列解耦，与普遍的前 2 名 MoE 架构相比，在 PCIe 和 NVLink 硬件环境中显示出 30% 和 11% 的训练速度提升，并且在推断方面提升了 40% 和 15%，其中通信占总 MoE 时间消耗的 60% 和 15%。此外，广泛的实验和理论分析表明，ScMoE 在视觉和语言任务中不仅达到了可比较的模型质量，而且在某些情况下超越了现有方法。

加速混合专家模型的快速连通专家并行

Shortcut-connected Expert Parallelism for Accelerating  Mixture-of-Experts

The field of natural language processing (NLP) has made significant strides
in recent years, particularly in the development of large-scale vision-language
models (VLMs). These models aim to bridge the gap between text and visual
information, enabling a more comprehensive understanding of multimedia data.
However, as these models become larger and more complex, they also become more
challenging to train and deploy. One approach to addressing this challenge is
the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the
model into smaller, specialized sub-models that can jointly solve a task. In
this paper, we explore the effectiveness of MoE in scaling vision-language
models, demonstrating its potential to achieve state-of-the-art performance on
a range of benchmarks over dense models of equivalent computational cost. Our
research offers valuable insights into stabilizing the training of MoE models,
understanding the impact of MoE on model interpretability, and balancing the
trade-offs between compute performance when scaling VLMs. We hope our work will
inspire further research into the use of MoE for scaling large-scale
vision-language models and other multimodal machine learning applications.

本研究探讨了使用稀疏门控专家组技术解决大规模视觉语言模型训练中的挑战，并在等效计算成本下实现最先进性能的潜力，通过稀疏门控专家组对模型解释性的影响及其与 VLM 扩展计算性能之间的折衷，本文为大规模视觉语言模型的扩展提供了宝贵的洞见，并希望能够激发对 MoE 在其他多模态机器学习应用中的研究。