Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.

UVIS是一种无监督视频实例分割框架，利用DINO模型的密集形状先验和CLIP模型的开放识别能力，通过帧级伪标签生成、基于Transformer的VIS模型训练和基于查询的跟踪等三个关键步骤实现，通过采用双存储器设计，包括语义存储器和跟踪存储器，以提高无监督环境下VIS预测的质量，在YoutubeVIS-2019等相应基准上取得了21.1 AP的结果，展示了该无监督VIS框架的潜力。

UVIS: 无监督视频实例分割