Robust explanations of machine learning models are critical to establish
human trust in the models. Due to limited cognition capability, most humans can
only interpret the top few salient features. It is critical to make top salient
features robust to adversarial attacks, especially those against the more
vulnerable gradient-based explanations. Existing defense measures robustness
using $\ell_p$-norms, which have weaker protection power. We define explanation
thickness for measuring salient features ranking stability, and derive
tractable surrogate bounds of the thickness to design the \textit{R2ET}
algorithm to efficiently maximize the thickness and anchor top salient
features. Theoretically, we prove a connection between R2ET and adversarial
training. Experiments with a wide spectrum of network architectures and data
modalities, including brain networks, demonstrate that R2ET attains higher
explanation robustness under stealthy attacks while retaining accuracy.

机器学习模型解释的鲁棒性对于建立人类对模型的信任至关重要，本研究提出了用于衡量显著特征排名稳定性的解释厚度，并通过设计 R2ET 算法以最大化厚度来保护易受攻击的梯度解释，实验证明 R2ET 在隐蔽攻击下具有更高的解释鲁棒性并保持准确性。

鲁棒的排名解释

Robust Ranking Explanations

We propose a margin-based loss for vision-language model pretraining that
encourages gradient-based explanations that are consistent with region-level
annotations. We refer to this objective as Attention Mask Consistency (AMC) and
demonstrate that it produces superior visual grounding performance compared to
models that rely instead on region-level annotations for explicitly training an
object detector such as Faster R-CNN. AMC works by encouraging gradient-based
explanation masks that focus their attention scores mostly within annotated
regions of interest for images that contain such annotations. Particularly, a
model trained with AMC on top of standard vision-language modeling objectives
obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding
benchmark, an absolute improvement of 5.48% when compared to the best previous
model. Our approach also performs exceedingly well on established benchmarks
for referring expression comprehension and offers the added benefit by design
of gradient-based explanations that better align with human annotations.

Attention Mask Consistency 是一种基于边缘的损失函数，在视觉语言模型预训练中作用使得梯度基础的解释与区域级别注释保持一致，并且比依赖于明确训练对象检测器的区域级注释的模型产生更优秀的视觉定位性能。

通过鼓励一致的基于梯度解释来改善视觉定位

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

Recent developments in machine learning have introduced models that approach
human performance at the cost of increased architectural complexity. Efforts to
make the rationales behind the models' predictions transparent have inspired an
abundance of new explainability techniques. Provided with an already trained
model, they compute saliency scores for the words of an input instance.
However, there exists no definitive guide on (i) how to choose such a technique
given a particular application task and model architecture, and (ii) the
benefits and drawbacks of using each such technique. In this paper, we develop
a comprehensive list of diagnostic properties for evaluating existing
explainability techniques. We then employ the proposed list to compare a set of
diverse explainability techniques on downstream text classification tasks and
neural network architectures. We also compare the saliency scores assigned by
the explainability techniques with human annotations of salient input regions
to find relations between a model's performance and the agreement of its
rationales with human ones. Overall, we find that the gradient-based
explanations perform best across tasks and model architectures, and we present
further insights into the properties of the reviewed explainability techniques.

本文评估了不同的解释方法及其对神经网络和文本分类任务的影响，发现梯度基础的解释方法在不同任务及神经网络结构中表现最佳。

文本分类可解释性技术的诊断研究

A Diagnostic Study of Explainability Techniques for Text Classification

We show through theory and experiment that gradient-based explanations of a
model quickly reveal the model itself. Our results speak to a tension between
the desire to keep a proprietary model secret and the ability to offer model
explanations. On the theoretical side, we give an algorithm that provably
learns a two-layer ReLU network in a setting where the algorithm may query the
gradient of the model with respect to chosen inputs. The number of queries is
independent of the dimension and nearly optimal in its dependence on the model
size. Of interest not only from a learning-theoretic perspective, this result
highlights the power of gradients rather than labels as a learning primitive.
Complementing our theory, we give effective heuristics for reconstructing
models from gradient explanations that are orders of magnitude more
query-efficient than reconstruction attacks relying on prediction interfaces.

该研究通过理论和实验表明，基于梯度的模型解释快速揭示模型本身，该结果强调了梯度而不是标签作为学习原语。同时，该研究提出了有效的启发式方法，以重新构建从梯度说明中获得的模型。