We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate
the mathematical reasoning abilities of Chinese language models. SC-Math6 is
designed as an upgraded Chinese version of the GSM8K dataset with enhanced
difficulty, diversity, and application scope. It consists of over 2000
mathematical word problems requiring multi-step reasoning and providing natural
language solutions. We propose an innovative scheme to quantify the reasoning
capability of large models based on performance over problems with different
reasoning steps. Experiments on 12 representative Chinese models demonstrate a
clear stratification of reasoning levels, with top models like GPT-4 showing
superior performance. SC-Math6 fills the gap in Chinese mathematical reasoning
benchmarks and provides a comprehensive testbed to advance the intelligence of
Chinese language models.

我们引入了 SuperCLUE-Math6（SC-Math6），这是一个新的基准数据集，用于评估中文语言模型的数学推理能力。SC-Math6 是 GSM8K 数据集的升级版，具有增强的难度、多样性和应用范围。它包含了 2000 多个需要多步推理并提供自然语言解决方案的数学问题。我们提出了一种创新方案来量化大模型的推理能力，基于其在具有不同推理步骤的问题上的表现。对 12 个代表性中文模型的实验表明，推理水平存在明显的分层，顶级模型如 GPT-4 表现出优异性能。SC-Math6 填补了中文数学推理基准的空白，并提供了一个全面的测试平台来推进中文语言模型的智能化发展。