Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex
video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex
Video Object Segmentation Track based on MOSE dataset and Motion Expression
guided Video Segmentation track based on MeViS dataset. In the two new tracks,
we provide additional videos and annotations that feature challenging elements,
such as the disappearance and reappearance of objects, inconspicuous small
objects, heavy occlusions, and crowded environments in MOSE. Moreover, we
provide a new motion expression guided video segmentation dataset MeViS to
study the natural language-guided video understanding in complex environments.
These new videos, sentences, and annotations enable us to foster the
development of a more comprehensive and robust pixel-level understanding of
video scenes in complex environments and realistic scenarios. The MOSE
challenge had 140 registered teams in total, 65 teams participated the
validation phase and 12 teams made valid submissions in the final challenge
phase. The MeViS challenge had 225 registered teams in total, 50 teams
participated the validation phase and 5 teams made valid submissions in the
final challenge phase.

复杂环境下像素级视频理解的挑战，提供了基于 MOSE 数据集的复杂视频对象分割以及基于 MeViS 数据集的运动表达引导的视频分割两个新的跟踪，并通过提供具有挑战性元素的额外视频和注释来促进像素级视频场景的综合和强大的理解。

复杂视频理解的 PVUW 2024 挑战：方法与结果

PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024)
challenge aims to advance the state of art in video understanding through
benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation
(VSS) on challenging videos and scenes introduced in the large-scale Video
Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video
Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our
research work that achieved the 1st place winner in the PVUW'24 VPS challenge,
establishing state of art results in all metrics, including the Video Panoptic
Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor
fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS
challenge ranked by the mIoU (mean intersection over union) metric and the
first place ranked by the VC16 (16-frame video consistency) metric. Our winning
solution stands on the shoulders of giant foundational vision transformer model
(DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation
(DVIS) frameworks for video understanding.

该研究论文详细介绍了我们在 PVUW'24 VPS 挑战中获得第一名的研究工作，以及在 PVUW'24 VSS 挑战中获得第三名的研究工作，该方案基于 DINOv2 ViT-g 视觉转换模型和多阶段分离的视频实例分割 (DVIS) 框架。

2024 年野外像素级视频理解竞赛（CVPR'24 PVUW）中视频全景分割优胜者，以及视频语义分割最佳长视频一致性

1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild  (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video  Consistency of Video Semantic Segmentation

Pixel-level Video Understanding requires effectively integrating
three-dimensional data in both spatial and temporal dimensions to learn
accurate and stable semantic information from continuous frames. However,
existing advanced models on the VSPW dataset have not fully modeled
spatiotemporal relationships. In this paper, we present our solution for the
PVUW competition, where we introduce masked video consistency (MVC) based on
existing models. MVC enforces the consistency between predictions of masked
frames where random patches are withheld. The model needs to learn the
segmentation results of the masked parts through the context of images and the
relationship between preceding and succeeding frames of the video.
Additionally, we employed test-time augmentation, model aggeregation and a
multimodal model-based post-processing method. Our approach achieves 67.27%
mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024
challenge VSS track.

我们提出了基于现有模型的基于蒙版视频一致性 (MVC) 的解决方案，通过在预测过程中强制保持遮挡帧之间的一致性来学习蒙版部分的分割结果和视频的前后帧之间的关系，同时采用测试时增强、模型聚合和多模态模型后处理方法，该方法在 VSPW 数据集上获得了 67.27％的 mIoU 性能，在 PVUW2024 挑战 VSS 跟踪中排名第 2。