Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

大型语言模型（LLMs）在各种零样本和小样本任务中表现出色，但它们的零样本和小样本设置的成功可能会受到任务污染的影响。本文研究了LLMs的零样本和小样本性能如何随时间的推移而变化。利用GPT-3系列模型和其他一些最近的开源LLMs，并控制数据集的难度，我们发现在LLMs的训练数据创建日期之前发布的数据集上，LLMs表现出令人惊讶的优势。这明显表明，对于许多LLMs来说，在LLMs的训练数据创建日期之前发布的数据集上存在零样本和小样本评估的任务污染。此外，我们利用训练数据检查、任务示例提取和成员推理攻击，揭示了更多关于任务污染的证据。重要的是，我们发现对于没有可能任务污染的分类任务，在零样本和小样本设置下，LLMs很少显示出与简单多数基准显著差异的改进。

任务干扰：现在语言模型可能不再是小样本学习了