自解释神经网络的稳健可解释性研究

Jun, 2018

自解释神经网络的稳健可解释性研究

Towards Robust Interpretability with Self-Explaining Neural Networks

David Alvarez-Melis, Tommi S. Jaakkola

TL;DR提出了自说明模型的三个特点——显式性，忠诚度和稳定性，旨在落实模型可解释性并实现复杂模型的解释性，通过特定模型的正则化实现忠诚度和稳定性的要求，实验结果表明，该框架为解决模型的复杂性和可解释性困境提供了一个有前途的方向。

Abstract

Most recent work on interpretability of complex machine learning models has focused on estimating $\textit{a posteriori}$ explanations for previously trained models around specific predictions. $\textit{Self-explaining}$ models where →