Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at https://github.com/caoql98/OVRS.

本研究针对开放词汇图像语义分割（OVS）在遥感图像中的应用，解决了现有方法在处理遥感图像特有特征时的不足。我们提出了首个专为遥感图像设计的OVS框架，引入了旋转聚合相似度计算模块和多尺度图像特征整合，显著提升了语义分割的准确性，并建立了首个开放的遥感图像OVS基准，为研究提供了重要工具和数据支持。

开放词汇遥感图像语义分割