Large language models (LLMs) have achieved impressive success on many
benchmarks for mathematical reasoning. However, there is growing concern that
some of this performance actually reflects dataset contamination, where data
closely resembling benchmark questions leaks into the training data, instead of
true reasoning ability. To investigate this claim rigorously, we commission
Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and
complexity of the established GSM8k benchmark, the gold standard for measuring
elementary mathematical reasoning. We ensure that the two benchmarks are
comparable across important metrics such as human solve rates, number of steps
in solution, answer magnitude, and more. When evaluating leading open- and
closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with
several families of models (e.g., Phi and Mistral) showing evidence of
systematic overfitting across almost all model sizes. At the same time, many
models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show
minimal signs of overfitting. Further analysis suggests a positive relationship
(Spearman's r^2=0.32) between a model's probability of generating an example
from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that
many models may have partially memorized GSM8k.

大型语言模型在数学推理的许多基准测试上取得了令人瞩目的成功，但人们越来越担心其中的一部分性能实际上是由于数据集污染，而不是真正的推理能力。调查显示，许多模型可能已经部分记忆了基准测试的例子，导致在新的基准测试上准确度下降。

关于小学算术的大型语言模型性能的仔细检查

A Careful Examination of Large Language Model Performance on Grade  School Arithmetic

Over the last few years natural language interfaces (NLI) for databases have
gained significant traction both in academia and industry. These systems use
very different approaches as described in recent survey papers. However, these
systems have not been systematically compared against a set of benchmark
questions in order to rigorously evaluate their functionalities and expressive
power.
In this paper, we give an overview over 24 recently developed NLIs for
databases. Each of the systems is evaluated using a curated list of ten sample
questions to show their strengths and weaknesses. We categorize the NLIs into
four groups based on the methodology they are using: keyword-, pattern-,
parsing-, and grammar-based NLI. Overall, we learned that keyword-based systems
are enough to answer simple questions. To solve more complex questions
involving subqueries, the system needs to apply some sort of parsing to
identify structural dependencies. Grammar-based systems are overall the most
powerful ones, but are highly dependent on their manually designed rules. In
addition to providing a systematic analysis of the major systems, we derive
lessons learned that are vital for designing NLIs that can answer a wide range
of user questions.

本文评估了 24 个最近开发的自然语言数据库接口（NLIs），并将其分类为基于关键字、模式、解析和语法的四组，发现语法为基础的系统是最强大的，但高度依赖其手动设计的规则，同时本文的研究成果对于设计能够回答各种用户问题的 NLIs 至关重要。