Natural Language Processing (NLP) research is increasingly focusing on the
use of Large Language Models (LLMs), with some of the most popular ones being
either fully or partially closed-source. The lack of access to model details,
especially regarding training data, has repeatedly raised concerns about data
contamination among researchers. Several attempts have been made to address
this issue, but they are limited to anecdotal evidence and trial and error.
Additionally, they overlook the problem of \emph{indirect} data leaking, where
models are iteratively improved by using data coming from users. In this work,
we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and
GPT-4, the most prominently used LLMs today, in the context of data
contamination. By analysing 255 papers and considering OpenAI's data usage
policy, we extensively document the amount of data leaked to these models
during the first year after the model's release. We report that these models
have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the
same time, we document a number of evaluation malpractices emerging in the
reviewed papers, such as unfair or missing baseline comparisons and
reproducibility issues. We release our results as a collaborative project on
this https URL, where other researchers can contribute to our
efforts.

使用 OpenAI 的 GPT-3.5 进行了首次系统分析，揭示其在数据污染方面的问题，发现模型在发布后一年内泄露了大约 470 万个样本来自 263 个基准，并记录了被评审论文中出现的不公平或缺失的基准比较和可复现性问题。