Recently, interpretable machine learning has re-explored concept bottleneck
models (CBM), comprising step-by-step prediction of the high-level concepts
from the raw features and the target variable from the predicted concepts. A
compelling advantage of this model class is the user's ability to intervene on
the predicted concept values, affecting the model's downstream output. In this
work, we introduce a method to perform such concept-based interventions on
already-trained neural networks, which are not interpretable by design, given
an annotated validation set. Furthermore, we formalise the model's
intervenability as a measure of the effectiveness of concept-based
interventions and leverage this definition to fine-tune black-box models.
Empirically, we explore the intervenability of black-box classifiers on
synthetic tabular and natural image benchmarks. We demonstrate that fine-tuning
improves intervention effectiveness and often yields better-calibrated
predictions. To showcase the practical utility of the proposed techniques, we
apply them to deep chest X-ray classifiers and show that fine-tuned black boxes
can be as intervenable and more performant than CBMs.

介绍了一种在已经训练好但不可解释的神经网络上进行基于概念的干预的方法，并将模型的可干预性定义为评估基于概念的干预效果的度量，通过对模型进行微调来改进干预效果并提高预测的校准性。实验结果表明，精调黑盒模型能够在干预效果上与概念瓶颈模型相媲美且更高性能。

超越概念瓶颈模型：如何使黑箱可干预？

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

Concept bottleneck models map from raw inputs to concepts, and then from
concepts to targets. Such models aim to incorporate pre-specified, high-level
concepts into the learning procedure, and have been motivated to meet three
desiderata: interpretability, predictability, and intervenability. However, we
find that concept bottleneck models struggle to meet these goals. Using post
hoc interpretability methods, we demonstrate that concepts do not correspond to
anything semantically meaningful in input space, thus calling into question the
usefulness of concept bottleneck models in their current form.

研究发现概念瓶颈模型很难满足解释性、可预测性和干预性三个目标，使用事后可解释性方法证明概念与输入空间中任何语义上有意义的东西都不对应，因此质疑概念瓶颈模型在目前形式下的实用性。