In this paper, we propose R$^3$: Learning Reasoning through Reverse
Curriculum Reinforcement Learning (RL), a novel method that employs only
outcome supervision to achieve the benefits of process supervision for large
language models. The core challenge in applying RL to complex reasoning is to
identify a sequence of actions that result in positive rewards and provide
appropriate supervision for optimization. Outcome supervision provides sparse
rewards for final results without identifying error locations, whereas process
supervision offers step-wise rewards but requires extensive manual annotation.
R$^3$ overcomes these limitations by learning from correct demonstrations.
Specifically, R$^3$ progressively slides the start state of reasoning from a
demonstration's end to its beginning, facilitating easier model exploration at
all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome
supervision to offer step-level signals and precisely pinpoint errors. Using
Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$
points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds
the baseline by $4.2$ points across three backbone models, and without any
extra data, Codellama-7B + R$^3$ performs comparable to larger models or
closed-source models.

本研究提出了一种名为 R^3 的学习推理的逆向课程强化学习方法，该方法仅借助结果监督实现大型语言模型的过程监督的益处。该方法通过学习正确演示，使用逐步课程逐渐滑动推理起始状态，以便在所有阶段更容易地进行模型探索，从而允许结果监督提供逐步信号并准确定位错误。使用 Llama2-7B，在八个推理任务上，我们的方法平均超过基于强化学习的基线 4.1 个点。值得注意的是，在基于程序的推理任务 GSM8K 上，与不使用任何额外数据的基线相比，Codellama-7B + R^3 在三个骨干模型上的表现相当于更大的模型或闭源模型。

通过逆序课程强化学习训练大规模语言模型

Training Large Language Models for Reasoning through Reverse Curriculum  Reinforcement Learning

In recent years, large language models have greatly improved in their ability
to perform complex multi-step reasoning. However, even state-of-the-art models
still regularly produce logical mistakes. To train more reliable models, we can
turn either to outcome supervision, which provides feedback for a final result,
or process supervision, which provides feedback for each intermediate reasoning
step. Given the importance of training reliable models, and given the high cost
of human feedback, it is important to carefully compare the both methods.
Recent work has already begun this comparison, but many questions still remain.
We conduct our own investigation, finding that process supervision
significantly outperforms outcome supervision for training models to solve
problems from the challenging MATH dataset. Our process-supervised model solves
78% of problems from a representative subset of the MATH test set.
Additionally, we show that active learning significantly improves the efficacy
of process supervision. To support related research, we also release PRM800K,
the complete dataset of 800,000 step-level human feedback labels used to train
our best reward model.

本文研究了监督方法对于训练语言模型的影响，发现在处理复杂的数学问题时，采用过程监督的方法能够显著提高模型的准确性，同时主动学习也可以有效增强过程监督的效果。最终文章提供了一个完整的数据集，并推荐将过程监督引入到其他相关语言模型的研究中。