Calibration, which establishes the correlation between accuracy and model
confidence, is important for LLM development. We design three off-the-shelf
calibration methods based on self-consistency (Wang et al., 2022) for math
reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using
strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model
confidence and accuracy than existing methods based on p(True) (Kadavath et
al., 2022) or logit (Kadavath et al., 2022).

我们设计了三种基于自洽性的成熟校准方法，用于数学推理任务的 LLM 发展。通过使用开源 LLMs（Mistral 和 LLaMA2）在 GSM8K 和 MathQA 两个流行的基准上进行评估，我们的方法在模型置信度和准确性之间建立了更好的联系，优于基于 p (True) 或 logit 的现有方法。

自洽性提升数学推理的校准

Self-Consistency Boosts Calibration for Math Reasoning

Large language models (LLMs) are displaying emergent abilities for math
reasoning tasks,and there is a growing attention on enhancing the ability of
open-source LLMs through supervised fine-tuning (SFT).In this paper, we aim to
explore a general data strategy for supervised data to help optimize and expand
math reasoning ability.Firstly, we determine the ability boundary of reasoning
paths augmentation by identifying these paths' minimal optimal set.Secondly, we
validate that different abilities of the model can be cumulatively enhanced by
Mix of Minimal Optimal Sets of corresponding types of data, while our models
MMOS achieve SOTA performance on series base models under much lower
construction costs.Besides, we point out GSM-HARD is not really hard and
today's LLMs no longer lack numerical robustness.Also, we provide an Auto
Problem Generator for robustness testing and educational applications.Our code
and data are publicly available at this https URL

通过识别推理路径的最佳集合来确定推理路径增强的能力边界，通过不同类型的数据的最佳集合的混合来累积增强模型的不同能力，以较低的建设成本实现 SOTA 性能，并提供用于鲁棒性测试和教育应用的自动问题生成器。

LLMs 数学推理中的数据能力边界的实证研究

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

As large language models (LLMs) have shown effectiveness with different
prompting methods, such as Chain of Thought, Program of Thought, we find that
these methods have formed a great complementarity to each other on math
reasoning tasks. In this work, we propose XoT, an integrated problem solving
framework by prompting LLMs with diverse reasoning thoughts. For each question,
XoT always begins with selecting the most suitable method then executes each
method iteratively. Within each iteration, XoT actively checks the validity of
the generated answer and incorporates the feedback from external executors,
allowing it to dynamically switch among different prompting methods. Through
extensive experiments on 10 popular math reasoning datasets, we demonstrate the
effectiveness of our proposed approach and thoroughly analyze the strengths of
each module. Moreover, empirical results suggest that our framework is
orthogonal to recent work that makes improvements on single reasoning methods
and can further generalise to logical reasoning domain. By allowing method
switching, XoT provides a fresh perspective on the collaborative integration of
diverse reasoning thoughts in a unified framework.

通过多样化的推理思路，XoT 提供了一个集成的解决问题框架，可以在数学推理任务中有效地选择最合适的方法，并动态地切换不同的提示方法。