Self-interpreting neural networks have garnered significant interest in research. Existing works in this domain often (1) lack a solid theoretical foundation ensuring genuine interpretability or (2) compromise model expressiveness. In response, we formulate a generic Additive Self-Attribution (ASA) framework. Observing the absence of Shapley value in Additive Self-Attribution, we propose Shapley Additive Self-Attributing Neural Network (SASANet), with theoretical guarantees for the self-attribution value equal to the output's Shapley values. Specifically, SASANet uses a marginal contribution-based sequential schema and internal distillation-based training strategies to model meaningful outputs for any number of features, resulting in un-approximated meaningful value function. Our experimental results indicate SASANet surpasses existing self-attributing models in performance and rivals black-box models. Moreover, SASANet is shown more precise and efficient than post-hoc methods in interpreting its own predictions.

通过引入Shapley值，提出了一种能够确保自解释性的泛化的自加性自归属神经网络模型(SASANet)。SASANet模型通过基于边际贡献的串行架构和内部精炼训练策略将有意义的输出模建模为任意数量特征的精确的有意义的价值函数，并且实验结果表明SASANet在性能上超越了现有的自归属模型并且能够与黑盒模型相媲美，同时在解释其自身预测方面更为准确和高效。

利用Shapley 加法自归因朝着忠实的神经网络内在解释