To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach utilizes meta-prompted language models as zero-shot verifiers to autonomously evaluate and select high-quality mathematical content, and we release the curated open-source AutoMathText dataset encompassing over 200GB of data. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter Mistral language model on the AutoMathText dataset, achieving substantial improvements in downstream performance on the MATH dataset with a token amount reduced by orders of magnitude compared to previous continuous pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

通过利用基础语言模型进行自主数据选择，改进语言模型在数学推理方面的能力，我们引入了一种创新的策略。该策略利用元提示语言模型作为零-shot验证器，自主评估和选择高质量的数学内容。我们发布了一个经过筛选的开源AutoMathText数据集，包含超过200GB的数据。我们将7B参数的Mistral语言模型连续预训练于AutoMathText数据集上，与以前的连续预训练工作相比，下游性能显著提高，并且标记数量大幅减少。我们的方法比基准方法提高了2倍的预训练标记效率，突显了我们方法在增强模型数学推理能力方面的潜力。AutoMathText数据集可在此https URL获取，代码可在此https URL获取。

自动数学文本：利用语言模型进行数学文本的自主数据选择