Interpretable computer vision models can produce transparent predictions, where the features of an image are compared with prototypes from a training dataset and the similarity between them forms a basis for classification. Nevertheless these methods are computationally expensive to train, introduce additional complexity and may require domain knowledge to adapt hyper-parameters to a new dataset. Inspired by developments in object detection, segmentation and large-scale self-supervised foundation vision models, we introduce Component Features (ComFe), a novel explainable-by-design image classification approach using a transformer-decoder head and hierarchical mixture-modelling. With only global image labels and no segmentation or part annotations, ComFe can identify consistent image components, such as the head, body, wings and tail of a bird, and the image background, and determine which of these features are informative in making a prediction. We demonstrate that ComFe obtains higher accuracy compared to previous interpretable models across a range of fine-grained vision benchmarks, without the need to individually tune hyper-parameters for each dataset. We also show that ComFe outperforms a non-interpretable linear head across a range of datasets, including ImageNet, and improves performance on generalisation and robustness benchmarks.

通过使用变形器解码器头和分层混合建模，我们介绍了一种名为Component Features (ComFe)的新型可解释的图像分类方法，能够仅通过全局图像标签，在没有分割或部件注释的情况下识别出一致的图像组件，并确定哪些特征对于做出预测是信息丰富的。我们证明了ComFe在一系列细粒度视觉基准测试中比以前的可解释模型获得更高的准确性，而无需为每个数据集单独调整超参数。我们还展示了ComFe在包括ImageNet在内的一系列数据集上优于非可解释的线性头，并提高了泛化和鲁棒性基准的性能。

可扩展和强大的Transformer解码器用于可解释的基础模型图像分类