Sparse activation, which selectively activates only an input-dependent set of
neurons in inference, is a useful technique to reduce the computing cost of
Large Language Models (LLMs) without retraining or adaptation efforts. However,
whether it can be applied to the recently emerging Small Language Models (SLMs)
remains questionable, because SLMs are generally less over-parameterized than
LLMs. In this paper, we aim to achieve sparse activation in SLMs. We first show
that the existing sparse activation schemes in LLMs that build on neurons'
output magnitudes cannot be applied to SLMs, and activating neurons based on
their attribution scores is a better alternative. Further, we demonstrated and
quantified the large errors of existing attribution metrics when being used for
sparse activation, due to the interdependency among attribution scores of
neurons across different layers. Based on these observations, we proposed a new
attribution metric that can provably correct such errors and achieve precise
sparse activation. Experiments over multiple popular SLMs and datasets show
that our approach can achieve 80% sparsification ratio with <5% model accuracy
loss, comparable to the sparse activation achieved in LLMs. The source code is
available at: this https URL

我们在小型语言模型（SLMs）中实现了稀疏激活，并通过新的归因测量指标以达到精确的稀疏激活，实验证明我们的方法可以在只损失 < 5% 的模型准确性的情况下实现 80% 的稀疏化比率，可与大型语言模型（LLMs）中实现的稀疏激活相媲美。

小型语言模型中实现稀疏激活

Achieving Sparse Activation in Small Language Models

We present SkillNet-NLG, a sparsely activated approach that handles many
natural language generation tasks with one model. Different from traditional
dense models that always activate all the parameters, SkillNet-NLG selectively
activates relevant parts of the parameters to accomplish a task, where the
relevance is controlled by a set of predefined skills. The strength of such
model design is that it provides an opportunity to precisely adapt relevant
skills to learn new tasks effectively. We evaluate on Chinese natural language
generation tasks. Results show that, with only one model file, SkillNet-NLG
outperforms previous best performance methods on four of five tasks.
SkillNet-NLG performs better than two multi-task learning baselines (a dense
model and a Mixture-of-Expert model) and achieves comparable performance to
task-specific models. Lastly, SkillNet-NLG surpasses baseline systems when
being adapted to new tasks.

我们介绍了 SkillNet-NLG，一种稀疏激活的方法，可以处理多种自然语言生成任务。该方法与传统的密集模型不同，SkillNet-NLG 仅选择激活与任务相关的参数，这由一组预定义的技能控制。实验结果表明，SkillNet-NLG 可以胜任五项任务中的四项，并且优于两种多任务模型和任务特定模型，同时也在适应新任务时表现出色。

SkillNet-NLG: 一种稀疏激活法的通用自然语言生成器

SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach

Mixture-of-Experts (MoE) models can achieve promising results with outrageous
large amount of parameters but constant computation cost, and thus it has
become a trend in model scaling. Still it is a mystery how MoE layers bring
quality gains by leveraging the parameters with sparse activation. In this
work, we investigate several key factors in sparse expert models. We observe
that load imbalance may not be a significant problem affecting model quality,
contrary to the perspectives of recent studies, while the number of sparsely
activated experts $k$ and expert capacity $C$ in top-$k$ routing can
significantly make a difference in this context. Furthermore, we take a step
forward to propose a simple method called expert prototyping that splits
experts into different prototypes and applies $k$ top-$1$ routing. This
strategy improves the model quality but maintains constant computational costs,
and our further exploration on extremely large-scale models reflects that it is
more effective in training larger models. We push the model scale to over $1$
trillion parameters and implement it on solely $480$ NVIDIA V100-32GB GPUs, in
comparison with the recent SOTAs on $2048$ TPU cores. The proposed giant model
achieves substantial speedup in convergence over the same-size baseline.

本文研究了稀疏专家模型中的关键因素，提出了专家原型法以改善模型质量，同时将模型规模扩大到 1 万亿参数，实现了与 TPU 相同的加速。