The sparsely activated mixture of experts (MoE) model presents a promising
alternative to traditional densely activated (dense) models, enhancing both
quality and computational efficiency. However, training MoE models from scratch
demands extensive data and computational resources. Moreover, public
repositories like timm mainly provide pre-trained dense checkpoints, lacking
similar resources for MoE models, hindering their adoption. To bridge this gap,
we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints
into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint
recycling, which repurposes dense checkpoints as initial weights for MoE
models, thereby accelerating convergence, enhancing accuracy, and alleviating
the computational burden of pre-training; (2) hyperspherical adaptive MoE
(SpheroMoE) layer, which optimizes the MoE architecture for better integration
of dense checkpoints, enhancing fine-tuning performance. Our experiments on
vision tasks demonstrate that MoE Jetpack significantly improves convergence
speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code
will be publicly available at this https URL

我们介绍了 MoE Jetpack，这是一种将密集检查点优化为 MoE 模型的有效方法。MoE Jetpack 包括两个关键技术：(1) 检查点回收，将密集检查点重新用于 MoE 模型的初始权重，以加速收敛、提高准确性并减轻预训练的计算负担；(2) 球形自适应 MoE (SpheroMoE) 层，为更好地融合密集检查点而优化 MoE 架构，提高精细调整性能。我们的实验证明 MoE Jetpack 在视觉任务中将密集检查点优化为 MoE 模型时显著提高了收敛速度和准确性。

MoE Jetpack：从密集检查点到自适应的专家混合用于视觉任务

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for  Vision Tasks

Mixture of experts (MoE) model is a statistical machine learning design that
aggregates multiple expert networks using a softmax gating function in order to
form a more intricate and expressive model. Despite being commonly used in
several applications owing to their scalability, the mathematical and
statistical properties of MoE models are complex and difficult to analyze. As a
result, previous theoretical works have primarily focused on probabilistic MoE
models by imposing the impractical assumption that the data are generated from
a Gaussian MoE model. In this work, we investigate the performance of the least
squares estimators (LSE) under a deterministic MoE model where the data are
sampled according to a regression model, a setting that has remained largely
unexplored. We establish a condition called strong identifiability to
characterize the convergence behavior of various types of expert functions. We
demonstrate that the rates for estimating strongly identifiable experts, namely
the widely used feed forward networks with activation functions
$\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than
those of polynomial experts, which we show to exhibit a surprising slow
estimation rate. Our findings have important practical implications for expert
selection.

在本研究中，我们探究了在数据按照回归模型进行采样的确定性混合专家模型下，最小二乘估计器（LSE）的性能，并建立了称为强可辨识性的条件，以表征不同类型的专家函数的收敛行为。我们证明了广泛使用的具有激活函数 sigmoid 和 tanh 的前馈网络专家的估计速度明显快于多项式专家，后者表现出令人惊讶的缓慢估计速度。我们的研究结果对专家选择具有重要的实际意义。