In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.

该论文介绍了上下文感知视频实例分割（CAVIS）的新框架，通过整合与每个对象相邻的上下文信息来增强实例关联性，提出上下文感知实例追踪器（CAIT）有效地提取和利用这些信息，并将周围的上下文数据与核心实例特征进行合并以提高追踪精度。此外，引入了原型跨帧对比（PCC）损失函数，确保帧间物体级特征的一致性，从而显著提高实例匹配的准确性。CAVIS在视频实例分割（VIS）和视频全景分割（VPS）的所有基准数据集上展示出优越性能，特别是在尤为具有挑战性的OVIS数据集上表现出色。

上下文感知的视频实例分割