The rapid advancement of Large Language Models (LLMs) in the realm of
mathematical reasoning necessitates comprehensive evaluations to gauge progress
and inspire future directions. Existing assessments predominantly focus on
problem-solving from the examinee perspective, overlooking a dual perspective
of examiner regarding error identification and correction. From the examiner
perspective, we define four evaluation tasks for error identification and
correction along with a new dataset with annotated error types and steps. We
also design diverse prompts to thoroughly evaluate eleven representative LLMs.
Our principal findings indicate that GPT-4 outperforms all models, while
open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source
models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most
challenging error type. Moreover, prompting LLMs with the error types can
improve the average correction accuracy by 47.9\%. These results reveal
potential directions for developing the mathematical reasoning abilities of
LLMs. Our code and dataset is available on this https URL

我们通过定义四个评估任务，并设计多样的提示来全面评估十一种代表性的 LLM 模型，从考官的角度出发，为错误识别和修正提供了新的数据集和注释的错误类型和步骤。研究结果表明 GPT-4 在所有模型中表现最佳，而开源模型 LLaMA-2-7B 的能力与闭源模型 GPT-3.5 和 Gemini Pro 相当。尤其是计算错误被证明是最具挑战性的错误类型。此外，使用错误类型提示 LLM 可以将平均修正准确率提高 47.9%。这些结果揭示了开发 LLM 的数学推理能力的潜在方向。

评估大型语言模型的数学推理能力：重点关注错误识别和纠正

Evaluating Mathematical Reasoning of Large Language Models: A Focus on  Error Identification and Correction

Given the rapid ascent of large language models (LLMs), we study the
question: (How) can large language models help in reviewing of scientific
papers or proposals? We first conduct some pilot studies where we find that (i)
GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly,
OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to
identify errors) outperforms prompting to simply write a review. With these
insights, we study the use of LLMs (specifically, GPT-4) for three tasks:
1. Identifying errors: We construct 13 short computer science papers each
with a deliberately inserted error, and ask the LLM to check for the
correctness of these papers. We observe that the LLM finds errors in 7 of them,
spanning both mathematical and conceptual errors.
2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist
questions in the respective sections of 15 NeurIPS 2022 papers. We find that
across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy.
3. Choosing the "better" paper: We generate 10 pairs of abstracts,
deliberately designing each pair in such a way that one abstract was clearly
superior than the other. The LLM, however, struggled to discern these
relatively straightforward distinctions accurately, committing errors in its
evaluations for 6 out of the 10 pairs.
Based on these experiments, we think that LLMs have a promising use as
reviewing assistants for specific reviewing tasks, but not (yet) for complete
evaluations of papers or proposals.

使用 GPT-4 大型语言模型来辅助论文审核的研究发现其可以有效识别大部分错误，然而在挑选更好的论文时还存在一定误差。