Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

本文针对现有模型在长文本中数学推理能力评估的缺失，提出了MathHay这一自动化基准测试。该基准不仅评估信息检索能力，还要求模型具备复杂的数学推理能力。实验结果表明，即使是表现最好的模型，Gemini-1.5-Pro-002，在长文本数学推理方面仅达到51.26%的准确率，显示出该领域仍有很大的改进空间。

MathHay：一种用于长文本数学推理的自动基准测试