We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

CFM-ViT是一种图像-文本预训练方法，具有对开放词汇目标检测进行图像和区域级别表示的同时学习能力。通过将掩码自编码器（MAE）目标与对比学习目标相结合，CFM-ViT在联合图像-文本嵌入空间中进行重构，以比传统的MAE方法更好地学习区域级语义。此外，引入位置嵌入丢弃（PED）来解决图像-文本预训练和检测微调之间的尺度变化，从而提高检测性能并利用冻结的ViT骨干作为区域分类器，避免在检测微调过程中遗忘开放词汇知识。在LVIS开放词汇检测基准下，CFM-ViT实现了33.9 AP$r$的最新成果，超过最佳方法7.6个点，并在零样本检测转移方面取得更好的效果。最后，CFM-ViT获得了强大的图像级表示，在8个零样本图像-文本检索基准中表现出了优于当前技术水平的成绩。

对比特征遮罩开放词汇视觉变换器