Recent advancements in Large Language Models (LLMs) and Multi-Modal Models
(MMs) have demonstrated their remarkable capabilities in problem-solving. Yet,
their proficiency in tackling geometry math problems, which necessitates an
integrated understanding of both textual and visual information, has not been
thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark,
a comprehensive collection that includes a main subset of 2000 problems, a 750
problem subset focusing on backward reasoning, an augmented subset of 2000
problems, and a hard subset of 300 problems. This benchmark facilitates a
deeper investigation into the performance of LLMs and MMs on solving geometry
math problems. Our evaluation of ten LLMs and MMs across these varied subsets
reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on
the main subset but only a 6.00\% accuracy on the challenging subset. This
highlights the critical need for testing models against datasets on which they
have not been pre-trained. Additionally, our findings indicate that GPT-series
models perform more effectively on problems they have rephrased, suggesting a
promising method for enhancing model capabilities.

最近的大型语言模型（LLMs）和多模态模型（MMs）在问题解决方面展示了卓越的能力，但它们在解决需要对文本和图像信息进行综合理解的几何数学问题方面的熟练程度尚未得到深入评估。为了填补这一空白，我们引入了 GeoEval 基准测试，它包括一个主子集、一个重点关注逆向推理的 750 个问题子集、一个增强的 2000 个问题子集和一个困难的 300 个问题子集。这个基准测试有助于更深入地研究 LLMs 和 MMs 在解决几何数学问题方面的性能。我们对这些不同子集中的十个 LLMs 和 MMs 进行评估，发现 WizardMath 模型表现出色，在主子集上的准确率达到 55.67％，但在困难子集上只有 6.00％的准确率。这突显了对模型进行在未经预训练的数据集上测试的重要性。此外，我们的研究结果表明，GPT 系列模型在它们重新表述的问题上表现更有效，这为增强模型能力提供了有希望的方法。