Supervised fine-tuning enhances the problem-solving abilities of language
models across various mathematical reasoning tasks. To maximize such benefits,
existing research focuses on broadening the training set with various data
augmentation techniques, which is effective for standard single-round
question-answering settings. Our work introduces a novel technique aimed at
cultivating a deeper understanding of the training problems at hand, enhancing
performance not only in standard settings but also in more complex scenarios
that require reflective thinking. Specifically, we propose reflective
augmentation, a method that embeds problem reflection into each training
instance. It trains the model to consider alternative perspectives and engage
with abstractions and analogies, thereby fostering a thorough comprehension
through reflective reasoning. Extensive experiments validate the achievement of
our aim, underscoring the unique advantages of our method and its complementary
nature relative to existing augmentation techniques.

监督微调通过各种数学推理任务增强了语言模型的问题解决能力。我们的研究引入了一种新的技术 —— 反思增强，通过嵌入问题反思来培养更深入的问题理解，从而不仅提高在标准场景下的性能，还在需要反思性思考的复杂场景中发挥作用。

超越答案所学：基于反思的数学推理语言模型训练

Learn Beyond The Answer: Training Language Models with Reflection for  Mathematical Reasoning

This paper investigates the performance of Large Language Models (LLMs) and
Tool-augmented LLMs in tackling complex mathematical reasoning tasks. We
introduce IMP-TIP: Improving Math Reasoning with Tool-augmented Interleaf
Prompting, a framework that combines the strengths of both LLMs and
Tool-augmented LLMs. IMP-TIP follows the ``From Good to Great" concept,
collecting multiple potential solutions from both LLMs and their Tool-Augmented
counterparts for the same math problem, and then selecting or re-generating the
most accurate answer after cross-checking these solutions via tool-augmented
interleaf prompting. The framework incorporates two key aspects: self-prompt
and tool-augmented interleaf prompting (TIP). The former allows LLMs to
autonomously refine and improve an initial prompt related to tool usage, while
the latter enables LLMs to derive the final answer by dynamically analyzing the
problem, cross-checking potential solutions, and revising previous reasoning
hints in an interleaved manner. Experimental analysis shows that IMP-TIP
achieves enhanced mathematical capabilities and outperforms traditional LLMs
and tool-augmented LLMs in accuracy and reasoning diversity on math reasoning
tasks. For instance, IMP-TIP can improve Tool-augmented ChatGPT on GSM8K-Hard
from 56.0% to 65.2%.

使用 IMP-TIP 框架结合了大型语言模型 (LLMs) 和增强工具的 LLMs 的优势，通过收集和交叉检查多个潜在解决方案，实现对复杂数学推理任务的改进。实验结果表明，IMP-TIP 在数学推理任务中具有增强的能力，相对于传统 LLMs 和增强工具的 LLMs，在准确性和推理多样性上都表现优异。

从优秀到卓越：利用工具辅助交错提示改进数学推理

From Good to Great: Improving Math Reasoning with Tool-Augmented  Interleaf Prompting

Large language models (LLMs) have recently demonstrated an impressive ability
to perform arithmetic and symbolic reasoning tasks, when provided with a few
examples at test time ("few-shot prompting"). Much of this success can be
attributed to prompting methods such as "chain-of-thought'', which employ LLMs
for both understanding the problem description by decomposing it into steps, as
well as solving each step of the problem. While LLMs seem to be adept at this
sort of step-by-step decomposition, LLMs often make logical and arithmetic
mistakes in the solution part, even when the problem is decomposed correctly.
In this paper, we present Program-Aided Language models (PAL): a novel approach
that uses the LLM to read natural language problems and generate programs as
the intermediate reasoning steps, but offloads the solution step to a runtime
such as a Python interpreter. With PAL, decomposing the natural language
problem into runnable steps remains the only learning task for the LLM, while
solving is delegated to the interpreter. We demonstrate this synergy between a
neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and
algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all
these natural language reasoning tasks, generating code using an LLM and
reasoning using a Python interpreter leads to more accurate results than much
larger models. For example, PAL using Codex achieves state-of-the-art few-shot
accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B
which uses chain-of-thought by absolute 15% top-1. Our code and data are
publicly available at this http URL .

本论文介绍了一种新颖的方法，使用大型语言模型来读取自然语言问题并生成程序作为中间推理步骤，但将求解步骤委托给运行时，如 Python 解释器，在 13 个数学、符号和算法推理任务中展示了神经大型语言模型和符号解释器之间的协同作用。