Recent advancements in Large Language Models (LLMs) and Multi-Modal Models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2000 problems, a 750 problem subset focusing on backward reasoning, an augmented subset of 2000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs on solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the challenging subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

最近的大型语言模型（LLMs）和多模态模型（MMs）在问题解决方面展示了卓越的能力，但它们在解决需要对文本和图像信息进行综合理解的几何数学问题方面的熟练程度尚未得到深入评估。为了填补这一空白，我们引入了GeoEval基准测试，它包括一个主子集、一个重点关注逆向推理的750个问题子集、一个增强的2000个问题子集和一个困难的300个问题子集。这个基准测试有助于更深入地研究LLMs和MMs在解决几何数学问题方面的性能。我们对这些不同子集中的十个LLMs和MMs进行评估，发现WizardMath模型表现出色，在主子集上的准确率达到55.67％，但在困难子集上只有6.00％的准确率。这突显了对模型进行在未经预训练的数据集上测试的重要性。此外，我们的研究结果表明，GPT系列模型在它们重新表述的问题上表现更有效，这为增强模型能力提供了有希望的方法。

GeoEval：几何问题解决中评估LLM和多模型的基准