Training on model-generated synthetic data is a promising approach for
finetuning LLMs, but it remains unclear when it helps or hurts. In this paper,
we investigate this question for math reasoning via an empirical study,
followed by building a conceptual understanding of our observations. First, we
find that while the typical approach of finetuning a model on synthetic correct
or positive problem-solution pairs generated by capable models offers modest
performance gains, sampling more correct solutions from the finetuned learner
itself followed by subsequent fine-tuning on this self-generated data
$\textbf{doubles}$ the efficiency of the same synthetic problems. At the same
time, training on model-generated positives can amplify various spurious
correlations, resulting in flat or even inverse scaling trends as the amount of
data increases. Surprisingly, we find that several of these issues can be
addressed if we also utilize negative responses, i.e., model-generated
responses that are deemed incorrect by a final answer verifier. Crucially,
these negatives must be constructed such that the training can appropriately
recover the utility or advantage of each intermediate step in the negative
response. With this per-step scheme, we are able to attain consistent gains
over only positive data, attaining performance similar to amplifying the amount
of synthetic data by $\mathbf{8 \times}$. We show that training on per-step
negatives can help to unlearn spurious correlations in the positive data, and
is equivalent to advantage-weighted reinforcement learning (RL), implying that
it inherits robustness benefits of RL over imitating positive data alone.

通过经验研究，我们发现利用模型生成的合成数据进行训练可以提高数学推理的性能，但是通过添加负回答可以进一步增强效果，并去除其中的虚假相关性。

通过错误的合成数据应用 RL 技术提高数理推理任务效率八倍

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math  Reasoning by Eight-Fold

Reinforcement learning from human feedback (RLHF) has been a central
technique for recent large language model (LLM) alignment. However, its heavy
dependence on costly human or LLM-as-Judge preference feedback could stymie its
wider applications. In this work, we introduce Self-Contrast, a feedback-free
large language model alignment method via exploiting extensive self-generated
negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast
leverages the LLM itself to generate massive diverse candidates, and harnesses
a pre-trained embedding model to filter multiple negatives according to text
similarity. Theoretically, we illustrate that in this setting, merely scaling
negative responses can still effectively approximate situations with more
balanced positive and negative preference annotations. Our experiments with
direct preference optimization (DPO) on three datasets show that, Self-Contrast
could consistently outperform SFT and standard DPO training by large margins.
And as the number of self-generated negatives increases, the performance of
Self-Contrast continues to grow. Code and data are available at
this https URL

通过利用自动生成的负例，自我对比是一种无需依赖人类反馈的大型语言模型对齐方法，仅通过有监督的微调目标，利用语言模型本身生成大量多样化的候选，并根据文本相似性使用预训练的嵌入模型筛选多个负例，实验证明在此设置下，仅通过缩放负响应仍可以有效地近似具有更平衡的正面和负面偏好注释的情况，通过对三个数据集的直接偏好优化实验表明，自我对比可以始终显著优于有监督微调和标准偏好优化训练，当自生成负例的数量增加时，自我对比的性能也在不断提高。