Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

本研究针对现有数学数据集在评估大型语言模型（LLMs）推理能力时的局限性，提出了使用符号程序进行自动化评估的新方法。通过提取知名数学数据集中的程序，研究表明这些程序能够有效 encapsulate 解决原文本问题所需的合理推理，而我们对不同输入输出对的应用评估显示出最先进的 LLMs 在数学推理上存在显著脆弱性。

ReasonAgain：利用可提取的符号程序评估数学推理