Open-vocabulary dense prediction tasks including object detection and image
segmentation have been advanced by the success of Contrastive Language-Image
Pre-training (CLIP). CLIP models, particularly those incorporating vision
transformers (ViTs), have exhibited remarkable generalization ability in
zero-shot image classification. However, when transferring the vision-language
alignment of CLIP from global image representation to local region
representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer
from the domain shift from full images to local image regions. In this paper,
we embark on an in-depth analysis of the region-language alignment in CLIP
models, which is essential for downstream open-vocabulary dense prediction
tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the
image-level recognition ability of CLIP ViT to local image regions without
needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by
aligning a region representation extracted from its dense feature map with the
image-level representation of the corresponding image crop. With the enhanced
CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary
object detection, semantic segmentation, and panoptic segmentation across
various benchmarks. Models and code will be available at
this https URL

该论文对 CLIP 模型中的区域 - 语言对齐进行了深入分析，并提出了一种名为 CLIPSelf 的方法，该方法能够将 CLIP ViTs 的图像级识别能力应用到局部图像区域中，从而在开放式词汇密集预测任务中取得了最新的最优性能。