A faithful and interpretable explanation of an AI model's behavior and internal structure is a high-level explanation that is human-intelligible but also consistent with the known, but often opaque low-level causal details of the model. We argue that the theory of causal abstraction provides the mathematical foundations for the desired kinds of model explanations. In causal abstraction analysis, we use interventions on model-internal states to rigorously assess whether an interpretable high-level causal model is a faithful description of an AI model. Our contributions in this area are: (1) We generalize causal abstraction to cyclic causal structures and typed high-level variables. (2) We show how multi-source interchange interventions can be used to conduct causal abstraction analyses. (3) We define a notion of approximate causal abstraction that allows us to assess the degree to which a high-level causal model is a causal abstraction of a lower-level one. (4) We prove constructive causal abstraction can be decomposed into three operations we refer to as marginalization, variable-merge, and value-merge. (5) We formalize the XAI methods of LIME, causal effect estimation, causal mediation analysis, iterated nullspace projection, and circuit-based explanations as special cases of causal abstraction analysis.

本文提出因果抽象理论作为高层次的AI模型解释的数学基础，使用因果抽象分析来确定可解释的高层次因果模型是否忠实反映了AI模型的行为和内部结构，同时我们还定义了近似因果抽象的概念以度量高层次因果模型对底层模型的抽象程度，并将LIME、因果效应估计、因果中介效应分析、迭代零空间投影和基于电路的解释方法形式化为因果抽象分析的特例。

用因果抽象进行忠实的模型解释