Direct Preference Optimization (DPO) using an implicit reward model has
proven to be an effective alternative to reinforcement learning from human
feedback (RLHF) for fine-tuning preference aligned large language models
(LLMs). However, the overall preference annotations of responses do not fully
capture the fine-grained quality of model outputs in complex multi-step
reasoning tasks, such as mathematical reasoning. To address this limitation, we
introduce a novel algorithm called Step-level Value Preference Optimization
(SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically
annotate step-level preferences for multi-step reasoning. Furthermore, from the
perspective of learning-to-rank, we train an explicit value model to replicate
the behavior of the implicit reward model, complementing standard preference
optimization. This value model enables the LLM to generate higher reward
responses with minimal cost during inference. Experimental results demonstrate
that our method achieves state-of-the-art performance on both in-domain and
out-of-domain mathematical reasoning benchmarks.

我们引入了一种名为 Step-level Value Preference Optimization (SVPO) 的新算法，它使用蒙特卡洛树搜索（MCTS）自动对多步推理进行步骤级别的偏好注释，并从学习排序的角度训练一个显式值模型来复制隐式奖励模型的行为，从而提高大型语言模型的生成回报响应性能。实验证明，我们的方法在领域内和领域外的数学推理基准测试上达到了最先进的性能。

数学推理的步骤级价值优化

Step-level Value Preference Optimization for Mathematical Reasoning

Human alignment in large language models (LLMs) is an active area of
research. A recent groundbreaking work, direct preference optimization (DPO),
has greatly simplified the process from past work in reinforcement learning
from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO,
after training, provides an implicit reward model. In this work, we make a
novel observation that this implicit reward model can by itself be used in a
bootstrapping fashion to further align the LLM. Our approach is to use the
rewards from a current LLM model to construct a preference dataset, which is
then used in subsequent DPO rounds. We incorporate refinements that debias the
length of the responses and improve the quality of the preference dataset to
further improve our approach. Our approach, named self-alignment with DPO
ImpliCit rEwards (DICE), shows great improvements in alignment and achieves
superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55%
length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and
no external feedback. Our code is available at this https URL

使用直接偏好优化（DPO）的隐式奖励模型，我们提出了自对齐方法，命名为 DPO 隐式奖励自对齐（DICE），以改进大语言模型的对齐性能和质量。