The leaderboard of Large Language Models (LLMs) in mathematical tasks has
been continuously updated. However, the majority of evaluations focus solely on
the final results, neglecting the quality of the intermediate steps. This
oversight can mask underlying problems, such as logical errors or unnecessary
steps in the reasoning process. To measure reasoning beyond final-answer
accuracy, we introduce ReasonEval, a new methodology for evaluating the quality
of reasoning steps. ReasonEval employs $\textit{validity}$ and
$\textit{redundancy}$ to characterize the reasoning quality, as well as
accompanying LLMs to assess them automatically. Instantiated by base models
that possess strong mathematical knowledge and trained with high-quality
labeled data, ReasonEval achieves state-of-the-art performance on human-labeled
datasets and can accurately detect different types of errors generated by
perturbation. When applied to evaluate LLMs specialized in math, we find that
an increase in final-answer accuracy does not necessarily guarantee an
improvement in the overall quality of the reasoning steps for challenging
mathematical problems. Additionally, we observe that ReasonEval can play a
significant role in data selection. We release the best-performing model,
meta-evaluation script, and all evaluation results at
this https URL

通过有效性和冗余性评估推理质量，我们提出了 ReasonEval 方法，该方法在数学任务中表现优异，并发现提高最终答案准确性并不一定能改善复杂数学问题推理步骤的整体质量。

评估数学推理能力的准确性以外的因素

Evaluating Mathematical Reasoning Beyond Accuracy

To comprehensively assess the capacity of current models for complex
reasoning, it is crucial to assess their step-by-step reasoning in a scalable
manner. Established reference-based evaluation metrics rely on human-annotated
reasoning chains to assess the model-derived chains. However, such
``gold-standard'' human-written reasoning chains may not be unique and their
acquisition is often labor-intensive. Existing reference-free reasoning metrics
eliminate the need for human-crafted reasoning chains as references, but they
typically require fine-tuning on datasets with human-derived reasoning chains,
which complicates the process and raises concerns regarding generalizability
across diverse datasets. To address these challenges, we harness GPT-4 to
automatically evaluate reasoning chain quality, obviating the need for
human-crafted references. Leveraging the Socratic method, we devise tailored
prompts to enhance reference-free reasoning evaluation, which we term SocREval
(Socratic method for Reasoning Evaluation). Empirical results from four human
annotated datasets reveal that SocREval significantly improves GPT-4's
performance, surpassing existing reference-free and reference-based reasoning
evaluation metrics. Beyond its demonstrated efficacy, our proposed framework,
large language models (LLMs) with the Socratic method, proves to be both
cost-efficient and robust to prompt writing and example selection, as
substantiated by our in-depth analysis.

利用 GPT-4 和苏格拉底方法，我们提出了一种新的基于 SocREval 的评估框架，能够自动评估当前模型的推理能力，并证明了该框架在消除人工参考链的情况下，显著提高了 GPT-4 的性能，超过了现有的基于参考和无参考的推理评估指标。同时，我们的研究表明这个框架在成本效益、提示编写和示例选择方面都是有效且健壮的。

SocREval: 使用苏格拉底方法进行无参考推理评估的大型语言模型

SocREval: Large Language Models with the Socratic Method for  Reference-Free Reasoning Evaluation

Visual Question Answering (VQA) methods have made incredible progress, but
suffer from a failure to generalize. This is visible in the fact that they are
vulnerable to learning coincidental correlations in the data rather than deeper
relations between image content and ideas expressed in language. We present a
dataset that takes a step towards addressing this problem in that it contains
questions expressed in two languages, and an evaluation process that co-opts a
well understood image-based metric to reflect the method's ability to reason.
Measuring reasoning directly encourages generalization by penalizing answers
that are coincidentally correct. The dataset reflects the scene-text version of
the VQA problem, and the reasoning evaluation can be seen as a text-based
version of a referring expression challenge. Experiments and analysis are
provided that show the value of the dataset.

该研究提出了一个多语言数据集，旨在解决视觉问题回答方法的泛化问题，利用基于推理的度量方法来鼓励泛化，并通过提供实验证据表明数据集的价值。