We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.

研究了后续模型解释对于诊断模型错误的有效性，通过将错误按来源分类为数据、模型和测试时污染性错误，评估了几种解释方法对查找虚假相关性、误标记训练实例、诊断非初始化模型、检测测试时污染输入等错误的能力，发现这些方法能够发现虚假背景错误，但不能明确识别误标记的训练实例，同时某些方法对深度网络高层参数缺乏鲁棒性，不能有效诊断模型污染性错误。人类主体研究表明，人们未能使用归因来识别有缺陷的模型，而是主要依赖于模型预测。这些结果为研究人员和从业者在使用解释作为模型调试工具时提供了指导。

为模型解释调试测试