Adversarial benchmarks validate model abilities by providing samples that
fool models but not humans. However, despite the proliferation of datasets that
claim to be adversarial, there does not exist an established metric to evaluate
how adversarial these datasets are. To address this lacuna, we introduce
ADVSCORE, a metric which quantifies how adversarial and discriminative an
adversarial dataset is and exposes the features that make data adversarial. We
then use ADVSCORE to underpin a dataset creation pipeline that incentivizes
writing a high-quality adversarial dataset. As a proof of concept, we use
ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from
our pipeline. The high-quality questions in ADVQA surpasses three adversarial
benchmarks across domains at fooling several models but not humans. We validate
our result based on difficulty estimates from 9,347 human responses on four
datasets and predictions from three models. Moreover, ADVSCORE uncovers which
adversarial tactics used by human writers fool models (e.g., GPT-4) but not
humans. Through ADVSCORE and its analyses, we offer guidance on revealing
language model vulnerabilities and producing reliable adversarial examples.

通过 ADVSCORE 量化和揭示数据集的对抗特征，同时使用 ADVSCORE 评估高质量对抗数据集，验证其在愚弄模型而不愚弄人类方面的能力，并揭示人类写作所使用的愚弄模型而不愚弄人类的对抗策略，从而指导揭示语言模型的弱点和生成可靠的对抗样本。

ADVSCORE：对对抗性基准评估与创建的度量

ADVSCORE: A Metric for the Evaluation and Creation of Adversarial  Benchmarks

We present a novel framework for generating adversarial benchmarks to
evaluate the robustness of image classification models. Our framework allows
users to customize the types of distortions to be optimally applied to images,
which helps address the specific distortions relevant to their deployment. The
benchmark can generate datasets at various distortion levels to assess the
robustness of different image classifiers. Our results show that the
adversarial samples generated by our framework with any of the image
classification models, like ResNet-50, Inception-V3, and VGG-16, are effective
and transferable to other models causing them to fail. These failures happen
even when these models are adversarially retrained using state-of-the-art
techniques, demonstrating the generalizability of our adversarial samples. We
achieve competitive performance in terms of net $L_2$ distortion compared to
state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we
demonstrate our framework achieves such results with simple distortions like
Gaussian noise without introducing unnatural artifacts or color bleeds. This is
made possible by a model-based reinforcement learning (RL) agent and a
technique that reduces a deep tree search of the image for model sensitivity to
perturbations, to a one-level analysis and action. The flexibility of choosing
distortions and setting classification probability thresholds for multiple
classes makes our framework suitable for algorithmic audits.

我们提出了一个新颖的框架，用于生成对抗性基准测试，以评估图像分类模型的鲁棒性。我们的框架允许用户定制应用于图像的最佳失真类型，从而帮助解决与其部署相关的特定失真问题。该基准测试可以生成不同失真水平的数据集，评估不同图像分类器的鲁棒性。我们的结果表明，我们的框架生成的对抗性样本在任何图像分类模型（如 ResNet-50、Inception-V3 和 VGG-16）上都是有效的并且具有传递性，导致其他模型失效。即使使用最先进的技术对这些模型进行对抗式重训练，这些失败仍然会发生，证明了我们对抗性样本的泛化能力。我们在 CIFAR-10 和 ImageNet 上的净 $L_2$ 失真方面取得了与最先进基准技术具有竞争性的性能；然而，我们的框架是通过简单的失真（如高斯噪声）实现这些结果的，而不引入不自然的伪影或颜色渗透。这得益于基于模型的强化学习（RL）代理和将对图像的深度树搜索减少到一级分析和动作的技术，来降低模型对扰动的敏感性。选择失真类型和设置多类别分类概率阈值的灵活性使得我们的框架适用于算法审核。