Large language models (LLMs) have shown increasing capability in
problem-solving and decision-making, largely based on the step-by-step
chain-of-thought reasoning processes. However, it has been increasingly
challenging to evaluate the reasoning capability of LLMs. Concretely, existing
outcome-based benchmarks begin to saturate and become less sufficient to
monitor the progress. To this end, we present a process-based benchmark MR-BEN
that demands a meta reasoning skill, where LMs are asked to locate and analyse
potential errors in automatically generated reasoning steps. MR-BEN is a
comprehensive benchmark comprising 5,975 questions collected from human
experts, covering various subjects such as physics, chemistry, logic, coding,
and more. Through our designed metrics for assessing meta-reasoning on this
benchmark, we identify interesting limitations and weaknesses of current LLMs
(open-source and closed-source models). For example, open-source models are
seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far
behind on our benchmark, revealing the underlying reasoning capability gap
between them. Our dataset and codes are available on
this https URL

大型语言模型在问题解决和决策方面表现出越来越强的能力，但评估其推理能力变得越来越具挑战性。为了解决这个问题，我们提出了一个基于过程的 MR-BEN 基准，要求语言模型在自动生成的推理步骤中找出并分析潜在的错误，通过这个基准，我们确定了当前语言模型的一些有趣限制和弱点。