Preference optimization methods have been successfully applied to improve not
only the alignment of large language models (LLMs) with human values, but also
specific natural language tasks such as summarization and stylistic
continuations. This paper proposes using preference optimization methods on
Chain-of-Thought steps in order to improve the reasoning performances of
language models. While the chosen answers are obtained from datasets that
include reasoning traces, we propose two complementary schemes for generating
rejected answers: digit corruption, and weak LLM prompting. Our approach leads
to increased accuracy on the GSM8K, AQuA-RAT, and ARC benchmarks for
Falcon2-11B and Mistral-7B. For example, the approach can lead to up to a
relative 8.47% increase in accuracy on the GSM8K benchmark without any extra
annotations. This work suggests that spending resources on creating more
datasets of reasoning traces would further boost LLM performances on informal
reasoning tasks.

这篇论文提出使用偏好优化方法来提高语言模型的推理性能，通过在思维链中应用这些方法，可以改进语言模型在推理任务中的表现。借助理由追踪数据集，我们提出了两种补充方案：数字损坏和弱语言模型提示。这种方法在 Falcon2-11B 和 Mistral-7B 的 GSM8K、AQuA-RAT 和 ARC 基准测试中提高了准确性，例如在 GSM8K 基准测试中，准确率相对提高了 8.47%，而不需要任何额外的注释。这项工作表明，在推理任务中创建更多的推理追踪数据集将进一步提升语言模型的性能。