One cornerstone of interpretable deep learning is the high degree of visual alignment that input-gradients, i.e.,the gradients of the output w.r.t. inputs, exhibit with the input data. This alignment is assumed to arise as a result of the model's generalization, justifying its use for interpretability. However, recent work has shown that it is possible to 'fool' models into having arbitrary gradients while achieving good generalization, thus falsifying the assumption above. This leaves an open question: if not generalization, what causes input-gradients to align with input data? In this work, we first show that it is simple to 'fool' input-gradients using the shift-invariance property of softmax, and that gradient structure is unrelated to model generalization. Second, we re-interpret the logits of standard classifiers as unnormalized log-densities of the data distribution, and find that we can improve this gradient alignment via a generative modelling objective called score-matching.To show this, we derive a novel approximation to the score-matching objective that eliminates the need for expensive Hessian computations, which may be of independent interest.Our experiments help us identify one factor that causes input-gradient alignment in models, that being the approximate generative modelling behaviour of the normalized logit distributions.

该研究分析了模型 input-gradients 在解释性方面的问题，提出了将标准 softmax-based 分类器的 logits 重新解释为未归一化的数据密度，证明了 input-gradients 可以被视为隐含于判别模型中的类条件密度模型的梯度，并提出了通过 score-matching 来实现对隐含密度模型与数据分布的对齐的算法。研究表明，对齐隐含密度模型和数据分布可以提高梯度的结构性和解释性。

重新思考基于梯度的属性方法在模型可解释性中的作用