Data contamination in language model evaluation is increasingly prevalent as
the popularity of large language models. It allows models to "cheat" via
memorisation instead of displaying true capabilities. Therefore, contamination
analysis has became an crucial part of reliable model evaluation to validate
results. However, existing contamination analysis is usually conducted
internally by LLM developers and often lacks transparency and completeness.
This paper present an open source data contamination reports for the Llama
series models. We analyse six popular multi-choice QA benchmarks and quantify
their overlapping with the training set of Llama. Various levels of
contamination ranging from 1\% to 8.7\% are found across benchmarks. Our
comparison also reveals that Llama models can gain over 5\% higher accuracy on
contaminated subsets versus clean subsets. Data and code are available at:
this https URL

该研究报告介绍了 Llama 系列模型的开源数据污染报告，对六个热门的多项选择问答基准进行了分析，量化了它们与 Llama 的训练集的重叠情况。发现基准中存在 1% 至 8.7% 不同程度的污染。比较还显示，与干净的子集相比，Llama 模型在被污染的子集上可以获得超过 5% 的更高准确率。数据和代码可在链接中获得。

Llama 系列模型的开源数据污染报告

An Open Source Data Contamination Report for Llama Series Models

Data contamination in model evaluation is getting increasingly prevalent as
the massive training corpora of large language models often unintentionally
include benchmark samples. Therefore, contamination analysis has became an
inevitable part of reliable model evaluation. However, existing method of
contamination analysis requires the access of the entire training data which is
often confidential for recent models. This prevent the community to rigorously
audit these models and conduct accurate assessment of their capability. In this
paper, we propose a novel method to quantify contamination without the access
of the full training set, that measure the extent of contamination with
perplexity. Our analysis provides evidence of significant memorisation of
recent foundation models in popular reading comprehension, summarisation
benchmarks, while multiple choice appears less contaminated.

最近的研究显示在大规模语言模型的训练语料中普遍存在数据污染问题，而现有的污染分析方法需要访问完整的训练数据，这常常限制了对这些模型的严格审计和准确评估。本文提出了一种新的方法来量化数据污染，通过困惑度来衡量污染程度，相关分析显示近期基础模型在流行的阅读理解和摘要化数据中存在显著的记忆化现象，而多项选择数据的污染程度较低。