Recent advancements in large language models (LLMs) have showcased
significant improvements in mathematics. However, traditional math benchmarks
like GSM8k offer a unidimensional perspective, falling short in providing a
holistic assessment of the LLMs' math capabilities. To address this gap, we
introduce MathBench, a new benchmark that rigorously assesses the mathematical
capabilities of large language models. MathBench spans a wide range of
mathematical disciplines, offering a detailed evaluation of both theoretical
understanding and practical problem-solving skills. The benchmark progresses
through five distinct stages, from basic arithmetic to college mathematics, and
is structured to evaluate models at various depths of knowledge. Each stage
includes theoretical questions and application problems, allowing us to measure
a model's mathematical proficiency and its ability to apply concepts in
practical scenarios. MathBench aims to enhance the evaluation of LLMs'
mathematical abilities, providing a nuanced view of their knowledge
understanding levels and problem solving skills in a bilingual context. The
project is released at this https URL .

通过 MathBench 新的基准测试，我们能够全面评估大型语言模型在数学能力方面的表现，首次提供了一个多维度视角，从基础算术到大学数学的不同阶段评估模型的能力，旨在提高对大型语言模型在数学能力方面的评估，为其知识水平和问题解决技能提供更深入的理解。

MathBench：利用分层数学基准评估 LLMs 的理论和应用水平

MathBench: Evaluating the Theory and Application Proficiency of LLMs  with a Hierarchical Mathematics Benchmark

Mathematical capabilities were previously believed to emerge in common
language models only at a very large scale or require extensive math-related
pre-training. This paper shows that the LLaMA-2 7B model with common
pre-training already exhibits strong mathematical abilities, as evidenced by
its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks,
respectively, when selecting the best response from 256 random generations. The
primary issue with the current base model is the difficulty in consistently
eliciting its inherent mathematical capabilities. Notably, the accuracy for the
first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks,
respectively. We find that simply scaling up the SFT data can significantly
enhance the reliability of generating correct answers. However, the potential
for extensive scaling is constrained by the scarcity of publicly available math
questions. To overcome this limitation, we employ synthetic data, which proves
to be nearly as effective as real data and shows no clear saturation when
scaled up to approximately one million samples. This straightforward approach
achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B
models, surpassing previous models by 14.2% and 20.8%, respectively. We also
provide insights into scaling behaviors across different reasoning complexities
and error types.

LLaMA-2 7B 模型通过简单的方法扩展数据样本，证明了其出色的数学能力及可靠性，适用于 GSM8K 和 MATH 基准测试，并提供了关于不同推理复杂性和错误类型的扩展行为的见解。

常见七 B 语言模型已经具备强大的数学能力

Common 7B Language Models Already Possess Strong Math Capabilities

We investigate the mathematical capabilities of ChatGPT by testing it on
publicly available datasets, as well as hand-crafted ones, and measuring its
performance against other models trained on a mathematical corpus, such as
Minerva. We also test whether ChatGPT can be a useful assistant to professional
mathematicians by emulating various use cases that come up in the daily
professional activities of mathematicians (question answering, theorem
searching). In contrast to formal mathematics, where large databases of formal
proofs are available (e.g., the Lean Mathematical Library), current datasets of
natural-language mathematics, used to benchmark language models, only cover
elementary mathematics. We address this issue by introducing a new dataset:
GHOSTS. It is the first natural-language dataset made and curated by working
researchers in mathematics that (1) aims to cover graduate-level mathematics
and (2) provides a holistic overview of the mathematical capabilities of
language models. We benchmark ChatGPT on GHOSTS and evaluate performance
against fine-grained criteria. We make this new dataset publicly available to
assist a community-driven comparison of ChatGPT with (future) large language
models in terms of advanced mathematical comprehension. We conclude that
contrary to many positive reports in the media (a potential case of selection
bias), ChatGPT's mathematical abilities are significantly below those of an
average mathematics graduate student. Our results show that ChatGPT often
understands the question but fails to provide correct solutions. Hence, if your
goal is to use it to pass a university exam, you would be better off copying
from your average peer!

本研究使用 GHOSTS 数据集评估了 ChatGPT 的数学能力和其他训练过数学语料库的模型相比，发现其数学能力显著低于普通数学研究生，并强调 GHOSTS 数据集的重要性以及未来大型语言模型在高级数学理解方面的比较研究。