Robust explanations of machine learning models are critical to establish human trust in the models. Due to limited cognition capability, most humans can only interpret the top few salient features. It is critical to make top salient features robust to adversarial attacks, especially those against the more vulnerable gradient-based explanations. Existing defense measures robustness using $\ell_p$-norms, which have weaker protection power. We define explanation thickness for measuring salient features ranking stability, and derive tractable surrogate bounds of the thickness to design the \textit{R2ET} algorithm to efficiently maximize the thickness and anchor top salient features. Theoretically, we prove a connection between R2ET and adversarial training. Experiments with a wide spectrum of network architectures and data modalities, including brain networks, demonstrate that R2ET attains higher explanation robustness under stealthy attacks while retaining accuracy.

机器学习模型解释的鲁棒性对于建立人类对模型的信任至关重要，本研究提出了用于衡量显著特征排名稳定性的解释厚度，并通过设计R2ET算法以最大化厚度来保护易受攻击的梯度解释，实验证明R2ET在隐蔽攻击下具有更高的解释鲁棒性并保持准确性。

鲁棒的排名解释