Our work presents a novel angle for evaluating language models' (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K problems labeled with these standards (MathFish). Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.

本研究探讨了语言模型在数学能力方面的评估缺口，提出了通过教育标准来分析语言模型理解数学技能的能力。我们开发了两个数据集，发现语言模型在标记和验证与问题相关的标准时存在困难，并且生成的问题往往与标准不完全一致。这项研究为理解语言模型解决某些数学问题的难易程度提供了新的视角。

通过教育课程评估语言模型的数学推理能力