Test set contamination, wherein test data from a benchmark ends up in a newer
model's training set, is a well-documented obstacle for fair LLM evaluation and
can quickly render benchmarks obsolete. To mitigate this, many recent
benchmarks crowdsource new prompts and evaluations from human or LLM judges;
however, these can introduce significant biases, and break down when scoring
hard questions. In this work, we introduce a new benchmark for LLMs designed to
be immune to both test set contamination and the pitfalls of LLM judging and
human crowdsourcing. We release LiveBench, the first benchmark that (1)
contains frequently-updated questions from recent information sources, (2)
scores answers automatically according to objective ground-truth values, and
(3) contains a wide variety of challenging tasks, spanning math, coding,
reasoning, language, instruction following, and data analysis. To achieve this,
LiveBench contains questions that are based on recently-released math
competitions, arXiv papers, news articles, and datasets, and it contains
harder, contamination-free versions of tasks from previous benchmarks such as
Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source
models, as well as dozens of open-source models ranging from 0.5B to 110B in
size. LiveBench is difficult, with top models achieving below 65% accuracy. We
release all questions, code, and model answers. Questions will be added and
updated on a monthly basis, and we will release new tasks and harder versions
of tasks over time so that LiveBench can distinguish between the capabilities
of LLMs as they improve in the future. We welcome community engagement and
collaboration for expanding the benchmark tasks and models.

为了解决测试集污染和评估中的偏见问题，研究引入了一种新的测试基准 LiveBench，该基准通过包含来自最新信息源的问题和按照客观真实值自动评分的答案，来评估不同大小的封闭源和开源模型的能力。

LiveBench：一个具有挑战性和无污染的 LLM 基准测试

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Large language models are trained on vast amounts of internet data, prompting
concerns and speculation that they have memorized public benchmarks. Going from
speculation to proof of contamination is challenging, as the pretraining data
used by proprietary models are often not publicly accessible. We show that it
is possible to provide provable guarantees of test set contamination in
language models without access to pretraining data or model weights. Our
approach leverages the fact that when there is no data contamination, all
orderings of an exchangeable benchmark should be equally likely. In contrast,
the tendency for language models to memorize example order means that a
contaminated language model will find certain canonical orderings to be much
more likely than others. Our test flags potential contamination whenever the
likelihood of a canonically ordered benchmark dataset is significantly higher
than the likelihood after shuffling the examples. We demonstrate that our
procedure is sensitive enough to reliably prove test set contamination in
challenging situations, including models as small as 1.4 billion parameters, on
small test sets of only 1000 examples, and datasets that appear only a few
times in the pretraining corpus. Using our test, we audit five popular publicly
accessible language models for test set contamination and find little evidence
for pervasive contamination.

通过无需预训练数据或模型权重的方法，我们可以提供对语言模型测试集污染的可证明保证，通过对典型排序的基准数据集的似然性进行比较，我们的测试能够可靠地证明测试集污染的情况。在五个常见的公开可访问的语言模型中，我们的测试发现很少有普遍污染的证据。