To improve trust and transparency, it is crucial to be able to interpret the decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such as attribution techniques, are commonly employed to interpret the model decisions. However, when interpreting misclassified decisions, human intervention may be required. Analyzing the attribu tions across each class within one instance can be particularly labor intensive and influenced by the bias of the human interpreter. In this paper, we present a novel framework to uncover the weakness of the classifier via counterfactual examples. A prober is introduced to learn the correctness of the classifier's decision in terms of binary code-hit or miss. It enables the creation of the counterfactual example concerning the prober's decision. We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets. Furthermore, by generating counterfactuals that penetrate the prober, we demonstrate that our framework effectively identifies vulnerabilities in the target classifier without relying on label information on the MNIST dataset.

本研究解决了深度神经分类器（DNN）决策可解释性不足的问题。我们提出了一种新颖的框架，通过反事实示例来揭示分类器的弱点，并介绍了一个探测器用以验证分类器的决策正确性。实验表明，该框架能够有效识别目标分类器的脆弱性，而无需依赖标签信息。

探究网络决策：在没有标签信息的情况下捕捉不确定性和揭示脆弱性