While large language models have achieved remarkable performance on various
code generation benchmarks, there have been growing concerns regarding
potential contamination of these benchmarks as they may be leaked into
pretraining and finetuning data. While recent work has investigated
contamination in natural language generation and understanding tasks, there has
been less extensive research into how data contamination impacts the evaluation
of code generation, which is critical for understanding the robustness and
reliability of LLMs in programming contexts. In this work, we perform a
comprehensive study of data contamination of popular code generation
benchmarks, and precisely quantify their overlap with pretraining corpus
through both surface-level and semantic-level matching. In our experiments, we
show that there are substantial overlap between popular code generation
benchmarks and open training corpus, and models perform significantly better on
the subset of the benchmarks where similar solutions are seen during training.
We also conduct extensive analysis on the factors that affects model
memorization and generalization, such as model size, problem difficulty, and
question length. We release all resulting files from our matching pipeline for
future research.

该研究综合研究了大型语言模型在代码生成任务中的数据污染问题，分析了常见代码生成基准测试与预训练语料之间的重叠程度，并揭示了类似训练解决方案出现时模型性能显著提高的现象，同时分析了模型大小、问题难度和问题长度等因素对模型记忆和泛化的影响。

评估语言模型代码生成能力时的污染量量化

Quantifying Contamination in Evaluating Code Generation Capabilities of  Language Models

Large language models are increasingly trained on all the data ever produced
by humans. Many have raised concerns about the trustworthiness of public
benchmarks due to potential contamination in pre-training or fine-tuning
datasets. While most data decontamination efforts apply string matching (e.g.,
n-gram overlap) to remove benchmark data, we show that these methods are
insufficient, and simple variations of test data (e.g., paraphrasing,
translation) can easily bypass these decontamination measures. Furthermore, we
demonstrate that if such variation of test data is not eliminated, a 13B model
can easily overfit a test benchmark and achieve drastically high performance,
on par with GPT-4. We validate such observations in widely used benchmarks such
as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a
stronger LLM-based decontamination method and apply it to widely used
pre-training and fine-tuning datasets, revealing significant previously unknown
test overlap. For example, in pre-training sets such as RedPajama-Data-1T and
StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps.
Interestingly, we also find such contamination in synthetic dataset generated
by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We
urge the community to adopt stronger decontamination approaches when using
public benchmarks. Moreover, we call for the community to actively develop
fresh one-time exams to evaluate models accurately. Our decontamination tool is
publicly available at this https URL

大型语言模型的数据污染问题及其对应的检查与净化方法