Image-Text pretraining on web-scale image caption dataset has become the
default recipe for open vocabulary classification and retrieval models thanks
to the success of CLIP and its variants. Several works have also used CLIP
features for dense prediction tasks and have shown the emergence of open-set
abilities. However, the contrastive objective only focuses on image-text
alignment and does not incentivise image feature learning for dense prediction
tasks. In this work, we propose the simple addition of local-to-global
correspondence learning by self-distillation as an additional objective for
contrastive pre-training to propose SILC. We show that distilling local image
features from an exponential moving average (EMA) teacher model significantly
improves model performance on several computer vision tasks including
classification, retrieval, and especially segmentation. We further show that
SILC scales better with the same training duration compared to the baselines.
Our model SILC sets a new state of the art for zero-shot classification, few
shot classification, image and text retrieval, zero-shot segmentation, and open
vocabulary segmentation.

基于对 CLIP 模型的改进，本研究提出了 SILC 方法，通过引入本地到全局对应学习来预训练模型，有效提升了计算机视觉领域中的分类、检索和分割等任务的性能，取得了零样本分类、少样本分类、图像与文本检索、无样本分割以及开放词汇分割等方面的最新技术成果。

SILC：用自我蒸馏提升视觉语言预训练

SILC: Improving Vision Language Pretraining with Self-Distillation

We present a new open-vocabulary detection approach based on
detection-oriented image-text pretraining to bridge the gap between image-level
pretraining and open-vocabulary object detection. At the pretraining phase, we
replace the commonly used classification architecture with the detector
architecture, which better serves the region-level recognition needs of
detection by enabling the detector heads to learn from noisy image-text pairs.
Using only standard contrastive loss and no pseudo-labeling, our approach is a
simple yet effective extension of the contrastive learning method to learn
emergent object-semantic cues. In addition, we propose a shifted-window
learning approach upon window attention to make the backbone representation
more robust, translation-invariant, and less biased by the window pattern. On
the popular LVIS open-vocabulary detection benchmark, our approach sets a new
state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone,
significantly outperforming the best existing approach by +6.5 mask AP$_r$ at
system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP
without pseudo labeling or weak supervision. In addition, we evaluate our
approach on the transfer detection setup, where ours outperforms the baseline
significantly. Visualization reveals emerging object locality from the
pretraining recipes compared to the baseline. Code and models will be publicly
released.

基于检测导向的图像 - 文本预训练的新的开放词汇检测方法用于填补图像级预训练和开放词汇对象检测之间的差距，通过使探测器头从嘈杂的图像 - 文本对中学习，我们的方法能够利用对比损失学习到新出现的对象 - 语义线索，在 LVIS 和 COCO 基准测试中均获得了非常有竞争力的结果，并在转移检测设置中显著优于基线。

面向检测的图像 - 文本预训练的开放词汇测量

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

CFM-ViT 是一种图像 - 文本预训练方法，具有对开放词汇目标检测进行图像和区域级别表示的同时学习能力。通过将掩码自编码器（MAE）目标与对比学习目标相结合，CFM-ViT 在联合图像 - 文本嵌入空间中进行重构，以比传统的 MAE 方法更好地学习区域级语义。此外，引入位置嵌入丢弃（PED）来解决图像 - 文本预训练和检测微调之间的尺度变化，从而提高检测性能并利用冻结的 ViT 骨干作为区域分类器，避免在检测微调过程中遗忘开放词汇知识。在 LVIS 开放词汇检测基准下，CFM-ViT 实现了 33.9 AP$r$ 的最新成果，超过最佳方法 7.6 个点，并在零样本检测转移方面取得更好的效果。最后，CFM-ViT 获得了强大的图像级表示，在 8 个零样本图像 - 文本检索基准中表现出了优于当前技术水平的成绩。