Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates? (2) What is the model's hidden (mental) reasoning process? (3) Do models solve math questions using skills similar to or different from humans? (4) Do models trained on GSM8K-like datasets develop reasoning skills beyond those necessary for solving GSM8K problems? (5) What mental process causes models to make reasoning mistakes? (6) How large or deep must a model be to effectively solve GSM8K-level math questions? Our study uncovers many hidden mechanisms by which language models solve mathematical questions, providing insights that extend beyond current understandings of LLMs.

本研究解决了语言模型在数学推理问题上的能力与过程缺口，通过一系列受控实验探讨了语言模型是否真正具备推理技能，及其思维过程的隐秘机制。研究发现，语言模型在处理数学问题时展现出的推理过程和错误来源，为更好的理解大规模语言模型提供了重要的见解。

语言模型的物理学：第二部分 2.1，初等数学与隐藏推理过程