The opacity in developing large language models (LLMs) is raising growing
concerns about the potential contamination of public benchmarks in the
pre-training data. Existing contamination detection methods are typically based
on the text overlap between training and evaluation data, which can be too
superficial to reflect deeper forms of contamination. In this paper, we first
present a cross-lingual form of contamination that inflates LLMs' performance
while evading current detection methods, deliberately injected by overfitting
LLMs on the translated versions of benchmark test sets. Then, we propose
generalization-based approaches to unmask such deeply concealed contamination.
Specifically, we examine the LLM's performance change after modifying the
original benchmark by replacing the false answer choices with correct ones from
other questions. Contaminated models can hardly generalize to such easier
situations, where the false choices can be \emph{not even wrong}, as all
choices are correct in their memorization. Experimental results demonstrate
that cross-lingual contamination can easily fool existing detection methods,
but not ours. In addition, we discuss the potential utilization of
cross-lingual contamination in interpreting LLMs' working mechanisms and in
post-training LLMs for enhanced multilingual capabilities. The code and dataset
we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.

开发大型语言模型的不透明性引起了关于潜在的训练数据污染的担忧。我们提出了一种基于跨语言的深层污染形式，可以欺骗传统的检测方法。我们还探讨了跨语言污染在解释语言模型的工作机制和提升多语言能力方面的潜在用途。

数据污染能够跨越语言障碍

Data Contamination Can Cross Language Barriers

Data contamination, i.e., the presence of test data from downstream tasks in
the training data of large language models (LLMs), is a potential major issue
in understanding LLMs' effectiveness on other tasks. We propose a
straightforward yet effective method for identifying data contamination within
LLMs. At its core, our approach starts by identifying potential contamination
in individual instances that are drawn from a small random sample; using this
information, our approach then assesses if an entire dataset partition is
contaminated. To estimate contamination of individual instances, we employ
"guided instruction:" a prompt consisting of the dataset name, partition type,
and the initial segment of a reference instance, asking the LLM to complete it.
An instance is flagged as contaminated if the LLM's output either exactly or
closely matches the latter segment of the reference. To understand if an entire
partition is contaminated, we propose two ideas. The first idea marks a dataset
partition as contaminated if the average overlap score with the reference
instances (as measured by ROUGE or BLEURT) is statistically significantly
better with the guided instruction vs. a general instruction that does not
include the dataset and partition name. The second idea marks a dataset as
contaminated if a classifier based on GPT-4 with in-context learning prompting
marks multiple instances as contaminated. Our best method achieves an accuracy
between 92% and 100% in detecting if an LLM is contaminated with seven
datasets, containing train and test/validation partitions, when contrasted with
manual evaluation by human expert. Further, our findings indicate that GPT-4 is
contaminated with AG News, WNLI, and XSum datasets.

在理解大型语言模型（LLM）对其他任务的有效性中，数据污染（即，在训练数据中存在来自下游任务的测试数据）可能是一个重要问题。我们提出了一种简单但有效的方法来识别 LLMs 中的数据污染，该方法通过识别来自小型随机样本的个别实例中的潜在污染，然后评估整个数据集分区是否受到了污染。