We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.

我们提出了一种基于视觉语言模型训练遥感图像的方法，无需使用任何文本注释。我们的关键洞察力是使用地面上的互联网图像作为遥感图像和语言之间的中介。通过使用大量的配对互联网和卫星图像，我们训练了遥感图像的图像编码器与CLIP的图像编码器对齐。我们的无监督方法使得能够训练一种新型的大规模遥感图像视觉语言模型(VLM)，适用于两种不同分辨率的遥感图像。我们展示了这些VLM在卫星图像的零样本、开放词汇的图像分类、检索、分割和视觉问答任务中的能力。我们的无需文本注释的VLM在这些任务的每个方面都优于现有有监督训练的VLM，分类任务上最高提升了20%，分割任务上提升了80%。

通过地面遥感对齐构建无需注释的遥感视觉-语言基础模型