Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

本研究针对现有视频分割模型在处理小规模或类别不平衡数据集时产生的不一致性问题，提出了一种新的训练策略——掩蔽视频一致性（MVC）。该方法通过随机掩蔽图像片段，增强了时空特征的聚合能力，并引入对象掩蔽注意力（OMA）优化交叉注意力机制，显著提高了模型在多个数据集上的性能。 

重新思考视频分割的掩蔽视频一致性：模型真的按预期学习了吗？