Large language models (LLMs) often struggle with maintaining accuracy across
a sequence of intermediate reasoning steps in mathematical reasoning, leading
to error propagation that undermines the final result. The current methodology
to mitigate this issue primarily involves using a verifier model to assess the
correctness of generated solution candidates, focusing either on the overall
reasoning path or on an incomplete reasoning path. By rethinking this approach,
we argue that assessing potentials of incomplete reasoning paths could be more
advantageous as it guides towards correct final answers, transforming the task
into a \textit{planning} problem. Our proposed verifier, the
Outcome-supervision Value Model (OVM), employs outcome supervision for
training, offering an efficient and intuitive method for \textit{planning} by
prioritizing steps that lead to accurate conclusions over mere per-step
correctness. Furthermore, the OVM eschews the need for labor-intensive
annotations on step-level correctness, enhancing its scalability. Our
experiments on two multi-step mathematical reasoning datasets, GSM8K and Game
of 24, demonstrate the superior performance of the OVM model. Notably, in
GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up
to 13B parameters}; especially it does not utilize GPT-4 or code execution.
These findings offer a novel perspective on the role of outcome supervision in
training verifiers for multi-step reasoning tasks and provide theoretical
justification for its advantage in value estimation for planning.

利用结果监督进行训练的 Outcome-supervision Value Model (OVM) 通过优先考虑能够导致准确结论的步骤，而非每一步的正确性，从而将多步推理转变为一种规划问题，提供了一种高效而直观的解决方法。在两个多步数学推理数据集 GSM8K 和 Game of 24 上的实验表明，OVM 模型取得了卓越的性能，特别是在 GSM8K 中，OVM-7B 模型在 LLMs 中达到了 13B 参数的最新成果。这些发现为多步推理任务中训练验证器的结果监督作用提供了新的视角，并为其在规划价值估计中的优势提供了理论依据。