The great success of Transformer-based models benefits from the powerful
multi-head self-attention mechanism, which learns token dependencies and
encodes contextual information from the input. Prior work strives to attribute
model decisions to individual input features with different saliency measures,
but they fail to explain how these input features interact with each other to
reach predictions. In this paper, we propose a self-attention attribution
method to interpret the information interactions inside Transformer. We take
BERT as an example to conduct extensive studies. Firstly, we apply
self-attention attribution to identify the important attention heads, while
others can be pruned with marginal performance degradation. Furthermore, we
extract the most salient dependencies in each layer to construct an attribution
tree, which reveals the hierarchical interactions inside Transformer. Finally,
we show that the attribution results can be used as adversarial patterns to
implement non-targeted attacks towards BERT.

本文提出了一种自我注意力归因方法，通过对 BERT 等模型进行广泛的研究，发现这种方法能够用于识别重要的注意力头，构建注意力树，揭示变压器内的分层交互，以及可用作敌对模式实现非定向攻击。

自注意力归因：解释 Transformer 内部的信息交互

Self-Attention Attribution: Interpreting Information Interactions Inside  Transformer

Patch-based attacks introduce a perceptible but localized change to the input
that induces misclassification. A limitation of current patch-based black-box
attacks is that they perform poorly for targeted attacks, and even for the less
challenging non-targeted scenarios, they require a large number of queries. Our
proposed PatchAttack is query efficient and can break models for both targeted
and non-targeted attacks. PatchAttack induces misclassifications by
superimposing small textured patches on the input image. We parametrize the
appearance of these patches by a dictionary of class-specific textures. This
texture dictionary is learned by clustering Gram matrices of feature
activations from a VGG backbone. PatchAttack optimizes the position and texture
parameters of each patch using reinforcement learning. Our experiments show
that PatchAttack achieves > 99% success rate on ImageNet for a wide range of
architectures, while only manipulating 3% of the image for non-targeted attacks
and 10% on average for targeted attacks. Furthermore, we show that PatchAttack
circumvents state-of-the-art adversarial defense methods successfully.

PatchAttack 是一种基于纹理字典和增强学习的有效的图像对抗攻击方法，可以在图像中超级位置化小型纹理贴片从而成功诱导图像分类错误，即使在针对性攻击的情况下仅更改 3％至 10％的图像。