Previous language model pre-training methods have uniformly applied a
next-token prediction loss to all training tokens. Challenging this norm, we
posit that "Not all tokens in a corpus are equally important for language model
training". Our initial analysis delves into token-level training dynamics of
language model, revealing distinct loss patterns for different tokens.
Leveraging these insights, we introduce a new language model called Rho-1.
Unlike traditional LMs that learn to predict every next token in a corpus,
Rho-1 employs Selective Language Modeling (SLM), which selectively trains on
useful tokens that aligned with the desired distribution. This approach
involves scoring pretraining tokens using a reference model, and then training
the language model with a focused loss on tokens with higher excess loss. When
continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute
improvement in few-shot accuracy of up to 30% in 9 math tasks. After
fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and
51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the
pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1
achieves 6.8% average enhancement across 15 diverse tasks, increasing both
efficiency and performance of the language model pre-training.

先前的语言模型预训练方法一直对所有训练标记应用相同的下一个标记预测损失。挑战这一规范，我们认为 “语言模型训练并非所有语料库中的标记都同等重要”。我们的初步分析探究了语言模型的标记级训练动态，揭示了不同标记的独特损失模式。利用这些见解，我们引入了一种名为 Rho-1 的新型语言模型。不同于传统的语言模型学习预测语料库中的每个下一个标记，Rho-1 采用选择性语言建模（SLM），选择性地训练与期望分布对齐的有用标记。该方法涉及使用参考模型对预训练标记进行评分，然后通过对具有更高过量损失的标记施加专注损失，训练语言模型。当在 150 亿个 OpenWebMath 语料库上进行连续预训练时，Rho-1 在 9 个数学任务中的小样本准确性上取得了高达 30% 的绝对改进。经过微调后，Rho-1-1B 和 7B 在 MATH 数据集上分别达到了 40.6% 和 51.8% 的最先进结果，相当于仅使用 3% 的预训练标记的 DeepSeekMath 的水平。此外，当在 800 亿个常规标记上进行预训练时，Rho-1 在 15 个不同任务中的平均提升率为 6.8%，提高了语言模型预训练的效率和性能。

Rho-1: 不是所有的令牌都是你所需要的

Rho-1: Not All Tokens Are What You Need

We propose an effective prompting approach that integrates self-evaluation
guidance through stochastic beam search. Our approach explores the reasoning
search space using a well-calibrated automatic criterion. This enables an
efficient search to produce higher-quality final predictions. With the
self-evaluation guided stochastic beam search, we also balance the
quality--diversity trade-off in the generation of reasoning chains. This allows
our approach to adapt well with majority voting and surpass the corresponding
Codex-backboned baselines by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K,
AQUA, and StrategyQA benchmarks, respectively, in few-shot accuracy. Analysis
of our decompositional reasoning finds it pinpoints logic failures and leads to
higher consistency and robustness.

该研究提出了一种有效的提示方法，通过随机波束搜索融合自我评估指导，可以平衡生成链的质量 - 多样性权衡，并在少次学习的情况下，分别在 GSM8K、AQUA 和 StrategyQA 基准测试中比相应的 Codex-backboned 基线高出 6.34％、9.56％和 5.46％的准确度，同时通过细粒度推理又找到并解决了逻辑失误的问题，提高了一致性和鲁棒性。