Mathematical word problem-solving has long been recognized as a complex task
for small language models (SLMs). A recent study hypothesized that the smallest
model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34
billion parameters. To reach this level of performance with smaller models,
researcher often train SLMs to generate Python code or use tools to help avoid
calculation errors. Additionally, they employ ensembling, where outputs of up
to 100 model runs are combined to arrive at a more accurate result. Result
selection is done using consensus, majority vote or a separate a verifier model
used in conjunction with the SLM. Ensembling provides a substantial boost in
accuracy but at a significant cost increase with multiple calls to the model
(e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5).
In this work, we present Orca-Math, a 7-billion-parameter SLM based on the
Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model
calls or the use of verifiers, code execution or any other external tools. Our
approach has the following key elements: (1) A high quality synthetic dataset
of 200K math problems created using a multi-agent setup where agents
collaborate to create the data, (2) An iterative learning techniques that
enables the SLM to practice solving problems, receive feedback on its solutions
and learn from preference pairs incorporating the SLM solutions and the
feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves
81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math
achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly
larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It
also significantly outperforms other smaller models while using much smaller
data (hundreds of thousands vs. millions of problems).

Orca-Math 是一个基于 Mistral-7B 的 70 亿参数 SLM，它能够在 GSM8k 上达到 86.81% 的准确率，无需多次调用模型或使用验证器、代码执行或其他外部工具。

鲸鱼数学：释放初高中数学中 SML 的潜力

Orca-Math: Unlocking the potential of SLMs in Grade School Math

Large language models such as GPT-3 and PaLM have shown remarkable
performance in few-shot learning. However, they still struggle with reasoning
tasks such as the arithmetic benchmark GSM8K. Recent advances deliberately
guide the language model to generate a chain of reasoning steps before
producing the final answer, successfully boosting the GSM8K benchmark from
17.9% to 58.1% in terms of problem solving rate. In this paper, we propose a
new approach, DiVeRSe (Diverse Verifier on Reasoning Step), to further advance
their reasoning capability. DiVeRSe first explores different prompts to enhance
the diversity in reasoning paths. Second, DiVeRSe introduces a verifier to
distinguish good answers from bad answers for a better weighted voting.
Finally, DiVeRSe verifies the correctness of each single step rather than all
the steps in a whole. We conduct extensive experiments using the latest
language model code-davinci-002 and demonstrate that DiVeRSe can achieve new
state-of-the-art performance on six out of eight reasoning benchmarks (e.g.,
GSM8K 74.4% to 83.2%), outperforming the PaLM model with 540B parameters.

本文介绍了一种名为 DiVeRSe 的方法，通过增加提示多样性和引入验证器来进一步提高大型语言模型的推理能力，成功地在八个基准测试中的六个上达到了最新的最先进性能，其中包括 GSM8K。

关于提高语言模型推理能力的进展

On the Advance of Making Language Models Better Reasoners

We explore how generating a chain of thought -- a series of intermediate
reasoning steps -- significantly improves the ability of large language models
to perform complex reasoning. In particular, we show how such reasoning
abilities emerge naturally in sufficiently large language models via a simple
method called chain of thought prompting, where a few chain of thought
demonstrations are provided as exemplars in prompting. Experiments on three
large language models show that chain of thought prompting improves performance
on a range of arithmetic, commonsense, and symbolic reasoning tasks. The
empirical gains can be striking. For instance, prompting a 540B-parameter
language model with just eight chain of thought exemplars achieves state of the
art accuracy on the GSM8K benchmark of math word problems, surpassing even
finetuned GPT-3 with a verifier.

通过 chain of thought prompting 方法，在大型语言模型中提供少量的思维链示例可以显著提高其在各类数学、常识和符号推理任务上的性能，甚至超过 fine-tuned GPT-3。