We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

我们提出了一种利用自监督预训练的视觉Transformer（ViT）来解决弱监督少样本图像分类和分割的方法，通过自注意力机制，利用自监督ViT的标记表示，通过独立的任务头预测分类和分割结果。实验结果表明，我们的模型在不需要像素级标签的情况下能够有效地学习分类和分割，只使用图像级别标签，并且在少量或无像素级标签的情况下表现出显著的性能提升。

自监督Vision Transformer的蒸馏用于弱监督少样本分类与分割