In this work, we introduce a novel evaluation paradigm for Large Language
Models, one that challenges them to engage in meta-reasoning. This approach
addresses critical shortcomings in existing math problem-solving benchmarks,
traditionally used to evaluate the cognitive capabilities of agents. Our
paradigm shifts the focus from result-oriented assessments, which often
overlook the reasoning process, to a more holistic evaluation that effectively
differentiates the cognitive capabilities among models. For example, in our
benchmark, GPT-4 demonstrates a performance ten times more accurate than
GPT3-5. The significance of this new paradigm lies in its ability to reveal
potential cognitive deficiencies in LLMs that current benchmarks, such as
GSM8K, fail to uncover due to their saturation and lack of effective
differentiation among varying reasoning abilities. Our comprehensive analysis
includes several state-of-the-art math models from both open-source and
closed-source communities, uncovering fundamental deficiencies in their
training and evaluation approaches. This paper not only advocates for a
paradigm shift in the assessment of LLMs but also contributes to the ongoing
discourse on the trajectory towards Artificial General Intelligence (AGI). By
promoting the adoption of meta-reasoning evaluation methods similar to ours, we
aim to facilitate a more accurate assessment of the true cognitive abilities of
LLMs.

我们引入了一种新颖的评估范式来评估大型语言模型，这种范式挑战了它们进行元推理。该方法解决了现有的数学问题解决基准测试中存在的关键缺陷，传统上用于评估代理的认知能力。我们的范式将重点从以结果为导向的评估转向更综合的评估，能够有效区分模型之间的认知能力。例如，在我们的基准测试中，GPT-4 的性能比 GPT3-5 准确率高十倍。这种新范式的重要性在于它能够揭示当前基准测试（如 GSM8K）未能发现的语言模型的潜在认知缺陷，这是由于它们的饱和度和在不同推理能力之间缺乏有效区分。我们的综合分析包括来自开源和闭源社区的几个最先进的数学模型，揭示了它们的训练和评估方法的根本缺陷。本文不仅主张在评估 LLMs 时进行范式转变，而且对于关于人工通用智能（AGI）的持续讨论也作出了贡献。通过推广类似于我们的元推理评估方法的采用，我们旨在促进对 LLM 真正认知能力的更准确评估。