The advancement of large language models (LLMs) relies on evaluation using
public benchmarks, but data contamination can lead to overestimated
performance. Previous researches focus on detecting contamination by
determining whether the model has seen the exact same data during training. In
this work, we argue that even training on data similar to benchmark data
inflates performance on in-distribution tasks without improving overall
capacity, which we called In-distribution contamination. To effectively detect
in-distribution contamination, we propose DICE, a novel method that leverages
the internal states of LLMs to locate-then-detect the contamination. DICE first
identifies the most sensitive layer to contamination, then trains a classifier
based on the internal states of that layer. Experiments reveal DICE's high
accuracy in detecting in-distribution contamination across various LLMs and
math reasoning datasets. We also show the generalization capability of the
trained DICE detector, which is able to detect contamination across multiple
benchmarks with similar distributions. Additionally, we find that the DICE
detection scores are positively correlated with the performance of ten LLMs
fine-tuned by either us or other organizations on four math reasoning datasets
(with $R^2$ values between 0.6 and 0.75). This indicates that the
in-distribution contamination problem potentially lead to an overestimation of
the true capabilities of many existing models. The code and data are available
at this https URL

该研究提出了一种名为 DICE 的新方法，通过使用大型语言模型的内部状态来检测分布内的污染，该方法在各种大型语言模型和数学推理数据集上具有高准确性，指出分布内的污染问题可能导致对现有模型真实能力的过高估计。

DICE：检测数学推理中 LLM 微调阶段的内分布污染

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase  for Math Reasoning

The success of large language models (LLMs), like GPT-3 and ChatGPT, has led
to the development of numerous cost-effective and accessible alternatives that
are created by fine-tuning open-access LLMs with task-specific data (e.g.,
ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning
methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly
one of the most attractive topics, as it only requires fine-tuning a few
external parameters instead of the entire LLMs while achieving comparable or
even better performance. To enable further research on PEFT methods of LLMs,
this paper presents LLM-Adapters, an easy-to-use framework that integrates
various adapters into LLMs and can execute these adapter-based PEFT methods of
LLMs for different tasks. The framework includes state-of-the-art open-access
LLMs such as LLaMA, BLOOM, OPT, and GPT-J, as well as widely used adapters such
as Series adapter, Parallel adapter, and LoRA. The framework is designed to be
research-friendly, efficient, modular, and extendable, allowing the integration
of new adapters and the evaluation of them with new and larger-scale LLMs.
Furthermore, to evaluate the effectiveness of adapters in LLMs-Adapters, we
conduct experiments on six math reasoning datasets. The results demonstrate
that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra
trainable parameters yields comparable, and in some cases superior, performance
to that of powerful LLMs (175B) in zero-shot inference on simple math reasoning
datasets. Overall, we provide a promising framework for fine-tuning large LLMs
on downstream tasks. We believe the proposed LLMs-Adapters will advance
adapter-based PEFT research, facilitate the deployment of research pipelines,
and enable practical applications to real-world systems.

本文提出 LLMs-Adapters 框架，利用少量可调参数对小型 LLMs 进行 fine-tuning，实现对各种任务的支持；在六种数学推理数据集上的实验表明，将 adapter-based PEFT 应用于小型 LLMs（7B）可以取得与强大的 LLMs（175B）相似甚至更优秀的性能，旨在推进 adapter-based PEFT 的研究，为 LM 大规模的 fine-tuning 提供了有价值的工具和框架。