Large language models (LLMs) offer impressive performance in various
zero-shot and few-shot tasks. However, their success in zero-shot and few-shot
settings may be affected by task contamination, a potential limitation that has
not been thoroughly examined. This paper investigates how zero-shot and
few-shot performance of LLMs has changed chronologically over time. Utilizing
GPT-3 series models and several other recent open-sourced LLMs, and controlling
for dataset difficulty, we find that on datasets released before the LLM
training data creation date, LLMs perform surprisingly better than on datasets
released after. This strongly indicates that, for many LLMs, there exists task
contamination on zero-shot and few-shot evaluation for datasets released prior
to the LLMs' training data creation date. Additionally, we utilize training
data inspection, task example extraction, and a membership inference attack,
which reveal further evidence of task contamination. Importantly, we find that
for classification tasks with no possibility of task contamination, LLMs rarely
demonstrate statistically significant improvements over simple majority
baselines, in both zero and few-shot settings.

大型语言模型（LLMs）在各种零样本和小样本任务中表现出色，但它们的零样本和小样本设置的成功可能会受到任务污染的影响。本文研究了 LLMs 的零样本和小样本性能如何随时间的推移而变化。利用 GPT-3 系列模型和其他一些最近的开源 LLMs，并控制数据集的难度，我们发现在 LLMs 的训练数据创建日期之前发布的数据集上，LLMs 表现出令人惊讶的优势。这明显表明，对于许多 LLMs 来说，在 LLMs 的训练数据创建日期之前发布的数据集上存在零样本和小样本评估的任务污染。此外，我们利用训练数据检查、任务示例提取和成员推理攻击，揭示了更多关于任务污染的证据。重要的是，我们发现对于没有可能任务污染的分类任务，在零样本和小样本设置下，LLMs 很少显示出与简单多数基准显著差异的改进。

任务干扰：现在语言模型可能不再是小样本学习了

Task Contamination: Language Models May Not Be Few-Shot Anymore

Estimating the difficulty of a dataset typically involves comparing
state-of-the-art models to humans; the bigger the performance gap, the harder
the dataset is said to be. However, this comparison provides little
understanding of how difficult each instance in a given distribution is, or
what attributes make the dataset difficult for a given model. To address these
questions, we frame dataset difficulty -- w.r.t. a model $\mathcal{V}$ -- as
the lack of $\mathcal{V}$-$\textit{usable information}$ (Xu et al., 2019),
where a lower value indicates a more difficult dataset for $\mathcal{V}$. We
further introduce $\textit{pointwise $\mathcal{V}$-information}$ (PVI) for
measuring the difficulty of individual instances w.r.t. a given distribution.
While standard evaluation metrics typically only compare different models for
the same dataset, $\mathcal{V}$-$\textit{usable information}$ and PVI also
permit the converse: for a given model $\mathcal{V}$, we can compare different
datasets, as well as different instances/slices of the same dataset.
Furthermore, our framework allows for the interpretability of different input
attributes via transformations of the input, which we use to discover
annotation artefacts in widely-used NLP benchmarks.

本文提出了一个度量模型难度的方法，并使用输入属性的变换模拟模型难度，发现了广泛使用的 NLP 基准测试集中的注释缺陷。

利用可利用信息解读数据集难度

Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

Recent work establishes dataset difficulty and removes annotation artifacts
via partial-input baselines (e.g., hypothesis-only models for SNLI or
question-only models for VQA). When a partial-input baseline gets high
accuracy, a dataset is cheatable. However, the converse is not necessarily
true: the failure of a partial-input baseline does not mean a dataset is free
of artifacts. To illustrate this, we first design artificial datasets which
contain trivial patterns in the full input that are undetectable by any
partial-input model. Next, we identify such artifacts in the SNLI dataset - a
hypothesis-only model augmented with trivial patterns in the premise can solve
15% of the examples that are previously considered "hard". Our work provides a
caveat for the use of partial-input baselines for dataset verification and
creation.

通过部分输入基线（如 SNLI 的假设模型或 VQA 的问题模型）确定数据集难度并消除注释伪装，但失败并不意味着数据集中没有伪装，因此我们设计了人工数据集，并在 SNLI 数据集中确定了这样的伪装，我们的工作为数据集的验证和创建提供了一个警示。

偏输入基准的误导性失败

Misleading Failures of Partial-input Baselines

In recent years an increasing number of researchers and practitioners have
been suggesting algorithms for large-scale neural network architecture search:
genetic algorithms, reinforcement learning, learning curve extrapolation, and
accuracy predictors. None of them, however, demonstrated high-performance
without training new experiments in the presence of unseen datasets. We propose
a new deep neural network accuracy predictor, that estimates in fractions of a
second classification performance for unseen input datasets, without training.
In contrast to previously proposed approaches, our prediction is not only
calibrated on the topological network information, but also on the
characterization of the dataset-difficulty which allows us to re-tune the
prediction without any training. Our predictor achieves a performance which
exceeds 100 networks per second on a single GPU, thus creating the opportunity
to perform large-scale architecture search within a few minutes. We present
results of two searches performed in 400 seconds on a single GPU. Our best
discovered networks reach 93.67% accuracy for CIFAR-10 and 81.01% for
CIFAR-100, verified by training. These networks are performance competitive
with other automatically discovered state-of-the-art networks however we only
needed a small fraction of the time to solution and computational resources.

该研究提出了一种新的深度神经网络准确性预测器，可以预测未知输入数据集的分类性能，在不需要任何训练的情况下，在单个 GPU 上每秒超过 100 个网络，大规模架构搜索只需要几分钟。