The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. Project page: https://www.vis.xyz/pub/ovtrack/

本研究解决了传统多目标跟踪方法只针对少数预定义对象类型的固有限制，并提出了一种新的任务Open-vocabulary MOT，进一步开发出一种数据效率优异的开放词汇跟踪器OVTrack，通过知识蒸馏和数据幻觉策略有效提升图像分类和关联准确性，最终在大规模的TAO基准测试上取得了最新的最优效果。

OVTrack: 开放式词汇多目标跟踪