Adversarial benchmarks validate model abilities by providing samples that
fool models but not humans. However, despite the proliferation of datasets that
claim to be adversarial, there does not exist an established metric to evaluate
how adversarial these datasets are. To address this lacuna, we introduce
ADVSCORE, a metric which quantifies how adversarial and discriminative an
adversarial dataset is and exposes the features that make data adversarial. We
then use ADVSCORE to underpin a dataset creation pipeline that incentivizes
writing a high-quality adversarial dataset. As a proof of concept, we use
ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from
our pipeline. The high-quality questions in ADVQA surpasses three adversarial
benchmarks across domains at fooling several models but not humans. We validate
our result based on difficulty estimates from 9,347 human responses on four
datasets and predictions from three models. Moreover, ADVSCORE uncovers which
adversarial tactics used by human writers fool models (e.g., GPT-4) but not
humans. Through ADVSCORE and its analyses, we offer guidance on revealing
language model vulnerabilities and producing reliable adversarial examples.

通过 ADVSCORE 量化和揭示数据集的对抗特征，同时使用 ADVSCORE 评估高质量对抗数据集，验证其在愚弄模型而不愚弄人类方面的能力，并揭示人类写作所使用的愚弄模型而不愚弄人类的对抗策略，从而指导揭示语言模型的弱点和生成可靠的对抗样本。