We investigate whether three types of post hoc model explanations--feature attribution, concept activation, and training point ranking--are effective for detecting a model's reliance on spurious signals in the training data. Specifically, we consider the scenario where the spurious signal to be detected is unknown, at test-time, to the user of the explanation method. We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts to obtain models that verifiably rely on these spurious training signals. We then provide a suite of metrics that assess an explanation method's reliability for spurious signal detection under various conditions. We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time especially for non-visible artifacts like a background blur. Further, we find that feature attribution methods are susceptible to erroneously indicating dependence on spurious signals even when the model being explained does not rely on spurious artifacts. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model's reliance on spurious signals.

通过使用半合成数据集和预设的假象损伤，我们设计了一种经验方法，通过提供一组指标来评估解释方法在各种条件下检测虚假信号的可靠性。我们发现，当假象仅在测试时由解释方法的用户不知道时，后续解释方法测试是无效的，尤其是对于非可见的背景模糊等假象。同时我们也发现即使所解释的模型不依赖于虚假损伤，特征显著性方法也容易错误地指示对虚假信息的依赖。这一发现使得这些方法检测模型对虚假信号的依赖性的效用存在疑问。

事後解釋對檢測未知假相關可能無效