Explainable AI (XAI) methods aim to describe the decision process of deep
neural networks. Early XAI methods produced visual explanations, whereas more
recent techniques generate multimodal explanations that include textual
information and visual representations. Visual XAI methods have been shown to
be vulnerable to white-box and gray-box adversarial attacks, with an attacker
having full or partial knowledge of and access to the target system. As the
vulnerabilities of multimodal XAI models have not been examined, in this paper
we assess for the first time the robustness to black-box attacks of the natural
language explanations generated by a self-rationalizing image-based activity
recognition model. We generate unrestricted, spatially variant perturbations
that disrupt the association between the predictions and the corresponding
explanations to mislead the model into generating unfaithful explanations. We
show that we can create adversarial images that manipulate the explanations of
an activity recognition model by having access only to its final output.

解释性人工智能 (XAI) 方法旨在描述深度神经网络的决策过程。本文首次评估基于自我合理化的图像识别模型生成的自然语言解释在黑盒攻击下的鲁棒性，我们通过对预测结果和相应解释之间的关联进行空间上的无限制、变异性的干扰来误导模型生成不忠实的解释。我们表明，即使只能访问模型的最终输出，我们也能通过创建对抗性图像来操纵活动识别模型的解释。

图像活动预测及其自然语言解释的黑盒攻击

Black-box Attacks on Image Activity Prediction and its Natural Language  Explanations

Neural text detectors aim to decide the characteristics that distinguish
neural (machine-generated) from human texts. To challenge such detectors,
adversarial attacks can alter the statistical characteristics of the generated
text, making the detection task more and more difficult. Inspired by the
advances of mutation analysis in software development and testing, in this
paper, we propose character- and word-based mutation operators for generating
adversarial samples to attack state-of-the-art natural text detectors. This
falls under white-box adversarial attacks. In such attacks, attackers have
access to the original text and create mutation instances based on this
original text. The ultimate goal is to confuse machine learning models and
classifiers and decrease their prediction accuracy.

本论文提出了基于字符和词语的变异操作方法，用于生成对抗样本以攻击最先进的自然文本检测器，从而逐渐减少机器学习模型和分类器的预测准确性。

基于突变的神经文本检测器对抗攻击

Mutation-Based Adversarial Attacks on Neural Text Detectors

We aim at demonstrating the influence of diversity in the ensemble of CNNs on
the detection of black-box adversarial instances and hardening the generation
of white-box adversarial attacks. To this end, we propose an ensemble of
diverse specialized CNNs along with a simple voting mechanism. The diversity in
this ensemble creates a gap between the predictive confidences of adversaries
and those of clean samples, making adversaries detectable. We then analyze how
diversity in such an ensemble of specialists may mitigate the risk of the
black-box and white-box adversarial examples. Using MNIST and CIFAR-10, we
empirically verify the ability of our ensemble to detect a large portion of
well-known black-box adversarial examples, which leads to a significant
reduction in the risk rate of adversaries, at the expense of a small increase
in the risk rate of clean samples. Moreover, we show that the success rate of
generating white-box attacks by our ensemble is remarkably decreased compared
to a vanilla CNN and an ensemble of vanilla CNNs, highlighting the beneficial
role of diversity in the ensemble for developing more robust models.

本文研究了利用多样的专业 CNNs 集成对黑盒对抗实例检测的影响，并加强白盒对抗攻击的生成，证明了不同专业集成的多样性如何减轻黑盒和白盒对抗示例的风险，并通过 MNIST 和 CIFAR-10 等实验证明了使用该集成可以检测大部分已知的黑盒对抗实例，从而显著降低敌人的风险率，但会在一定程度上增加干净样本的风险率。此外，相对于普通 CNN 和普通 CNN 集成，我们展示了集成生成白盒攻击的成功率显著下降，突显了集成中多样性对于开发更健壮模型的有益作用。