The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications. Within the natural domain, the procurement of vision-language datasets to construct these foundation models is facilitated by their abundant availability and the ease of web crawling. Conversely, in the remote sensing domain, although vision-language datasets exist, their volume is suboptimal for constructing robust foundation models. This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model, negating the need for human-annotated labels. Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval. Moreover, in tasks exclusively employing vision encoders, such as linear probing and k-NN classification, our model demonstrated superior efficacy compared to those relying on domain-specific vision-language datasets.

本研究解决了遥感领域视觉-语言数据集不足的问题。通过引入图像解码机器学习模型，研究者能够无需人工标注收集约960万对视觉-语言数据集。结果表明，该模型在零样本分类、语义定位和图像-文本检索等下游任务中优于未使用公开数据集的对手，展示了显著的效能提升。

在没有人工标注的情况下推动视觉-语言模型在遥感中的极限