We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. We identify two fundamental axioms---Sensitivity and Implementation Invariance that attribution methods ought to satisfy. We show that they are not satisfied by most known attribution methods, which we consider to be a fundamental weakness of those methods. We use the axioms to guide the design of a new attribution method called Integrated Gradients. Our method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator. We apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a deep network, and to enable users to engage with models better.

本文研究了深度网络输入特征对预测的影响，提出了敏感性和实现不变性两个公理，并指出大部分已知的边缘归因方法并不满足这两个公理。最后，作者设计了一种不需要修改原始网络的全新边缘归因方法——集成梯度，并将其应用于图像、文本和化学模型中。结果表明，该方法不仅具有调试和提取规则的功能，还能够有效地帮助用户更好地使用模型。

深度神经网络公理归因