Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.

本研究提出了RS-CapRet，一种远程感知任务的视觉和语言方法，主要用于图像字幕生成和文本-图像检索。通过对远程感知图像进行对比性语言-图像预训练，我们将高性能大型解码器语言模型与适应远程感知图像的图像编码器结合使用。RS-CapRet能够为远程感知图像生成描述，并能够根据文本描述检索图像，以实现与现有方法相媲美的性能。定性结果表明，RS-CapRet能够有效利用预训练的大型语言模型描述远程感知图像，并能够处理图像和文本的交错序列对话。

用于标题和检索遥感图像的大规模语言模型