Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\url{https://github.com/WesLee88524/LG-MOT}.

通过结合多模态的语言驱动特征和视觉特征，在多目标跟踪中提出了一种新的LG-MOT框架，它在不同层次上（场景和实例级）明确利用语言信息并且与标准视觉特征结合以获得判别性表示。通过在现有的MOT数据集中注释场景和实例级的语言描述，将语言信息编码到高维度嵌入中，并在训练过程中用于引导视觉特征。在三个基准测试集MOT17、DanceTrack和SportsMOT上进行了广泛实验，结果显示提出的方法在性能上达到了最先进水平，并在舞蹈跟踪测试集上相对于仅使用视觉特征的基线方法具有绝对增益2.2％。此外，所提出的LG-MOT表现出良好的跨领域泛化能力。

多粒度语言指导的多目标追踪