多语言基准的污染报告

Oct, 2024

Contamination Report for Multilingual Benchmarks

Sanchit Ahuja, Varun Gumma, Sunayana Sitaram

TL;DR本研究解决了大语言模型（LLM）预训练或后训练数据中基准污染的问题，影响评估结果并掩盖模型能力。我们使用 Black Box 测试分析了 $7$ 个流行的多语言基准在 $7$ 个知名开源和闭源 LLM中的污染情况，几乎所有模型均显示出与测试的基准有关的污染迹象。这一发现将帮助学术界确定最佳的多语言评估基准。

Abstract

Benchmark Contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In