Human priors play a crucial role in efficiently utilizing data in deep
learning. However, with the development of large language models (LLMs), there
is an increasing emphasis on scaling both model size and data volume, which
often diminishes the importance of human priors in data construction.
Influenced by these trends, existing Small Language Models (SLMs) mainly rely
on web-scraped large-scale training data, neglecting the proper incorporation
of human priors. This oversight limits the training efficiency of language
models in resource-constrained settings. In this paper, we propose a principle
to leverage human priors for data construction. This principle emphasizes
achieving high-performance SLMs by training on a concise dataset that
accommodates both semantic diversity and data quality consistency, while
avoiding benchmark data leakage. Following this principle, we train an SLM
named HARE-1.1B. Extensive experiments on large-scale benchmark datasets
demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs,
validating the effectiveness of the proposed principle. Additionally, this
provides new insights into efficient language model training in
resource-constrained environments from the view of human priors.

在资源受限环境中进行高效语言模型训练的研究，提出了一种利用人类先验知识进行数据构建的原则，并通过在简洁数据集上训练 HARE-1.1B 模型来验证该原则的有效性。

HARE：人类先验：小语言模型效率的关键

HARE: HumAn pRiors, a key to small language model Efficiency

Sparse activation, which selectively activates only an input-dependent set of
neurons in inference, is a useful technique to reduce the computing cost of
Large Language Models (LLMs) without retraining or adaptation efforts. However,
whether it can be applied to the recently emerging Small Language Models (SLMs)
remains questionable, because SLMs are generally less over-parameterized than
LLMs. In this paper, we aim to achieve sparse activation in SLMs. We first show
that the existing sparse activation schemes in LLMs that build on neurons'
output magnitudes cannot be applied to SLMs, and activating neurons based on
their attribution scores is a better alternative. Further, we demonstrated and
quantified the large errors of existing attribution metrics when being used for
sparse activation, due to the interdependency among attribution scores of
neurons across different layers. Based on these observations, we proposed a new
attribution metric that can provably correct such errors and achieve precise
sparse activation. Experiments over multiple popular SLMs and datasets show
that our approach can achieve 80% sparsification ratio with <5% model accuracy
loss, comparable to the sparse activation achieved in LLMs. The source code is
available at: this https URL

我们在小型语言模型（SLMs）中实现了稀疏激活，并通过新的归因测量指标以达到精确的稀疏激活，实验证明我们的方法可以在只损失 < 5% 的模型准确性的情况下实现 80% 的稀疏化比率，可与大型语言模型（LLMs）中实现的稀疏激活相媲美。