We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

本研究针对大型语言模型在复杂推理任务中的不足，提出了一种新颖的强化学习框架——直接价值优化（DVO）。通过在每个推理步骤中利用价值信号，DVO显著提高了模型性能，且进行了较少的训练步骤，显示出其在缺乏明确人类偏好信息的情况下的优越性。

直接价值优化：通过优化价值提升大型语言模型的思考链推理