This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models.

本研究介绍了ConceptMath，它是一个双语（英文和中文）的细粒度基准，用于评估大型语言模型的概念级数学推理能力。与评估一般数学推理平均准确率的传统基准不同，ConceptMath通过将数学问题按照数学概念的层次进行系统组织，从而可以用概念级准确率评估数学推理能力的不同细粒度。在基于我们的ConceptMath的基础上，我们评估了广泛范围的大型语言模型，并观察到现有的大型语言模型尽管在传统基准上具有高平均准确率，但在不同数学概念上存在显著的性能差异，甚至在最基本的概念上可能出现灾难性失误。此外，我们还介绍了一种高效的微调策略，以提高现有大型语言模型的弱点。最后，我们希望ConceptMath能够指导开发人员了解其模型的细粒度数学能力，并促进基础模型的进一步发展。

ConceptMath：大型语言模型数学推理的双语概念评估基准