Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

本研究解决了开放词汇多目标跟踪（OVMOT）中的检测与跟踪不同类别物体的挑战。提出的VOVTrack方法通过整合与多目标跟踪相关的对象状态和视频中心训练，创新性地改进了目标定位和分类。实验结果表明，VOVTrack在开放词汇跟踪任务中优于现有方法，成为该领域的最先进解决方案。

VOVTrack：探索视频中的开放词汇目标跟踪潜力