Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 J and 54.8 F, which are 10% better than the baselines. Furthermore, we show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS.

提出了一种基于Transformer架构的简单而有效的模型，该模型通过新设计的跨模态亲和力（CMA）模块在很少的样本上构建多模态亲和力，从而快速学习新的语义信息，并使模型可适应不同场景，为少样本的视频目标指代分割（FS-RVOS）问题提供了一种解决方案。在新建立的FS-RVOS基准上进行了广泛实验，结果显示我们的模型在只有少数样本的情况下能很好地适应不同场景，达到了基准上的最先进性能。

学习跨模态关联性以用于有限样本的参照视频对象分割