Large language models (LLMs) demonstrate impressive capabilities in
mathematical reasoning. However, despite these achievements, current
evaluations are mostly limited to specific mathematical topics, and it remains
unclear whether LLMs are genuinely engaging in reasoning. To address these
gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging
and structured benchmark that offers 1,958 questions across a wide array of
mathematical subjects, each paired with a detailed hierarchical chain of
topics. Upon assessing different LLMs using the MaTT benchmark, we find that
the most advanced model, GPT-4, achieved a mere 54\% accuracy in a
multiple-choice scenario. Interestingly, even when employing Chain-of-Thought
prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy
dramatically reduced by up to 24.2 percentage point when the questions were
presented without providing choices. Further detailed analysis of the LLMs'
performance across a range of topics showed significant discrepancy even for
closely related subtopics within the same general mathematical area. In an
effort to pinpoint the reasons behind LLMs performances, we conducted a manual
evaluation of the completeness and correctness of the explanations generated by
GPT-4 when choices were available. Surprisingly, we find that in only 53.3\% of
the instances where the model provided a correct answer, the accompanying
explanations were deemed complete and accurate, i.e., the model engaged in
genuine reasoning.

大语言模型在数学推理方面展示了令人印象深刻的能力，但目前的评估仅限于特定的数学主题，不清楚大语言模型是否真正参与了推理。为了填补这些研究空白，我们提出了数学主题树（MaTT）基准，这是一个具有挑战性和结构化的基准，提供了 1,958 个关于各种数学学科的问题，并配以详细的层级链。通过使用 MaTT 基准评估不同的大语言模型，我们发现最先进的模型 GPT-4 在多项选择场景下仅达到 54％的准确度。有趣的是，即使在使用思维链提示的情况下，我们也几乎没有观察到明显的改进。此外，当问题在没有可选项的情况下提供时，大语言模型的准确度显著下降了 24.2 个百分点。对于一系列主题的大语言模型的详细性能分析表明，即使是在相同的数学领域中的紧密相关子主题之间，也存在显着差异。为了找出大语言模型性能背后的原因，当有可选项时，我们对 GPT-4 生成的解释的完整性和准确性进行了手动评估。令人惊讶的是，在模型提供正确答案的情况下，只有 53.3％的解释被认为是完整和准确的，即模型进行了真正的推理。