Dynamic benchmarks interweave model fitting and data collection in an attempt
to mitigate the limitations of static benchmarks. In contrast to an extensive
theoretical and empirical study of the static setting, the dynamic counterpart
lags behind due to limited empirical studies and no apparent theoretical
foundation to date. Responding to this deficit, we initiate a theoretical study
of dynamic benchmarking. We examine two realizations, one capturing current
practice and the other modeling more complex settings. In the first model,
where data collection and model fitting alternate sequentially, we prove that
model performance improves initially but can stall after only three rounds.
Label noise arising from, for instance, annotator disagreement leads to even
stronger negative results. Our second model generalizes the first to the case
where data collection and model fitting have a hierarchical dependency
structure. We show that this design guarantees strictly more progress than the
first, albeit at a significant increase in complexity. We support our
theoretical analysis by simulating dynamic benchmarks on two popular datasets.
These results illuminate the benefits and practical limitations of dynamic
benchmarking, providing both a theoretical foundation and a causal explanation
for observed bottlenecks in empirical work.

本研究理论分析了动态基准测试的两种实现方式，第一种模型中，模型性能最初会有所提高，但只会在三轮后停滞，而第二种模型则保证了比第一种模型更多的进展，但复杂度更高，并通过模拟动态基准测试的结果来验证了理论分析，为动态基准测试提供了理论和实践上的支持。

动态基准理论

A Theory of Dynamic Benchmarks

Machine learning models are often brittle on production data despite
achieving high accuracy on benchmark datasets. Benchmark datasets have
traditionally served dual purposes: first, benchmarks offer a standard on which
machine learning researchers can compare different methods, and second,
benchmarks provide a model, albeit imperfect, of the real world. The
incompleteness of test benchmarks (and the data upon which models are trained)
hinder robustness in machine learning, enable shortcut learning, and leave
models systematically prone to err on out-of-distribution and adversarially
perturbed data. The mismatch between a single static benchmark dataset and a
production dataset has traditionally been described as a dataset shift. In an
effort to clarify how to address the mismatch between test benchmarks and
production data, we introduce context shift to describe semantically meaningful
changes in the underlying data generation process. Moreover, we identify three
methods for addressing context shift that would otherwise lead to model
prediction errors: first, we describe how human intuition and expert knowledge
can identify semantically meaningful features upon which models systematically
fail, second, we detail how dynamic benchmarking - with its focus on capturing
the data generation process - can promote generalizability through
corroboration, and third, we highlight that clarifying a model's limitations
can reduce unexpected errors. Robust machine learning is focused on model
performance beyond benchmarks, and as such, we consider three model organism
domains - facial expression recognition, deepfake detection, and medical
diagnosis - to highlight how implicit assumptions in benchmark tasks lead to
errors in practice. By paying close attention to the role of context,
researchers can design more comprehensive benchmarks, reduce context shift
errors, and increase generalizability.

研究探讨了机器学习模型在生产数据上的脆弱性，并提出了上下文偏移的概念，探讨了三种应对上下文偏移的方法：人类直觉和专业知识辅助建模，动态基准测试提高泛化能力，提高模型的透明度，文章还通过人脸表情识别、深度伪造检测和医学诊断等三个领域对模型偏差的隐含假设进行了探讨。

测试基准和生产数据之间上下文变化的识别

Identifying the Context Shift between Test Benchmarks and Production Data

We introduce Dynabench, an open-source platform for dynamic dataset creation
and model benchmarking. Dynabench runs in a web browser and supports
human-and-model-in-the-loop dataset creation: annotators seek to create
examples that a target model will misclassify, but that another person will
not. In this paper, we argue that Dynabench addresses a critical need in our
community: contemporary models quickly achieve outstanding performance on
benchmark tasks but nonetheless fail on simple challenge examples and falter in
real-world scenarios. With Dynabench, dataset creation, model development, and
model assessment can directly inform each other, leading to more robust and
informative benchmarks. We report on four initial NLP tasks, illustrating these
concepts and highlighting the promise of the platform, and address potential
objections to dynamic benchmarking as a new standard for the field.

Dynabench 是一个开源平台，支持动态数据集创建和模型基准测试，可以在一个 web 浏览器中运行。通过人和模型操作，使 annotators 创建能够被目标模型误分类但另一个人不能误分类的示例。本文认为，Dynabench 解决了当前模型在基准任务上表现优异，但在简单的挑战示例和实际场景中失败的问题。我们针对四个初始 NLP 任务，阐述了这些概念，突出了 Dynabench 平台的优点，并解决了动态基准测定作为新标准引起的潜在反对意见。