Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretations, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent they conform to the reasoning process behind the model. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations, that quantify different notions of faithfulness, and propose novel paradigms to systematically evaluate interpretations in NLP. Our results show that the performance of interpretations under different criteria of faithfulness could vary substantially. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial robustness domain. Empirical results show that our proposed methods achieve top performance under all three criteria. Along with experiments and analysis on both the text classification and the dependency parsing tasks, we come to a more comprehensive understanding of the diverse set of interpretations.

该研究提出两个衡量NLP模型判断过程解释准确性的新标准，即灵敏度和稳定性，并引入了一种新的基于对抗性鲁棒性的解释方法，证明了其在相应标准下的优越性。还应用该方法和度量标准在依赖分析中。

自然语言处理中的模型解释的敏感性和稳定性