Common self-improvement approaches for large language models (LLMs), such as
STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on self-generated
solutions to improve their problem-solving ability. However, these approaches
discard the large amounts of incorrect solutions generated during this process,
potentially neglecting valuable information in such solutions. To address this
shortcoming, we propose V-STaR that utilizes both the correct and incorrect
solutions generated during the self-improvement process to train a verifier
using DPO that judges correctness of model-generated solutions. This verifier
is used at inference time to select one solution among many candidate
solutions. Running V-STaR for multiple iterations results in progressively
better reasoners and verifiers, delivering a 4% to 17% test accuracy
improvement over existing self-improvement and verification approaches on
common code generation and math reasoning benchmarks with LLaMA2 models.

通过使用 DPO 判断模型生成的正确和错误解，V-STaR 提出了一种利用自我改进过程中生成的正确和错误解的方法，用于训练验证器，并在推理时从众多候选解中选择一种解，多次运行 V-STaR 可以逐渐提升推理能力和正确性，并在常见代码生成和数学推理基准中相较于现有的自我改进和验证方法提高了 4％至 17％的测试准确率。