Recent work has shown the immense potential of synthetically generated
datasets for training large language models (LLMs), especially for acquiring
targeted skills. Current large-scale math instruction tuning datasets such as
MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed
using outputs from closed-source LLMs with commercially restrictive licenses. A
key reason limiting the use of open-source LLMs in these data generation
pipelines has been the wide gap between the mathematical skills of the best
closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on
the recent progress in open-source LLMs, our proposed prompting novelty, and
some brute-force scaling, we construct OpenMathInstruct-1, a math instruction
tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by
synthesizing code-interpreter solutions for GSM8K and MATH, two popular math
reasoning benchmarks, using the recently released and permissively licensed
Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of
OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which
is competitive with the best gpt-distilled models. We release our code, models,
and the OpenMathInstruct-1 dataset under a commercially permissive license.

利用合成数据集训练大型语言模型（LLMs）的巨大潜力已被展示，尤其是用于获得有针对性的技能。本研究基于开源 LLMs 的最新进展和引导创新，通过某些粗暴的扩展构建了一个包含 180 万个问题 - 解决方案对的数学指导调整数据集 OpenMathInstruct-1，并在 GSM8K 和 MATH 两个热门数学推理基准上取得了与最佳 gpt - 蒸馏模型相竞争的成绩。我们以商业许可证发布了我们的代码、模型和 OpenMathInstruct-1 数据集。

OpenMathInstruct-1：一个 180 万数学指导调优数据集

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

In the context of multi-step reasoning, language models (LMs) probabilities
are often miscalibrated -- solutions with high probabilities are not always
correct. Therefore, greedy decoding, which is the standard decoding method for
reasoning tasks, often yields incorrect solutions. In addition, methods such as
self-consistency and verifiers rely on sampling from the LM distribution and do
not tackle the underlying issue. To address this, we introduce Guiding
Multi-step ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise
decoding approach that nudges the model towards producing correct reasoning
steps. GRACE employs a discriminator model, which is trained to differentiate
correct steps from invalid ones, to adjust decoding preferences based on the
correctness of each reasoning step. Importantly, GRACE does not require
fine-tuning or re-training the LMs. When compared with conventional decoding
strategies over four popular math reasoning benchmarks, GRACE exhibits
significant improvements in both final answer accuracy and step correctness,
outperforming both greedy decoding and self-consistency.\footnote{Our code can
be found at https://github.com/mukhal/grace.}

该研究提出了一种基于 stepwise decoding 方法的 Guiding Multi-step ReAsoning with a CorrectnEss Discriminator (GRACE) ，它使用判别模型来调整 LM 的解码策略，从而提高多步推理的准确性。与传统解码策略相比，GRACE 在四个流行的数学推理基准测试中均表现出显著的改进。