Text-based video segmentation aims to segment the target object in a video
based on a describing sentence. Incorporating motion information from optical
flow maps with appearance and linguistic modalities is crucial yet has been
largely ignored by previous work. In this paper, we design a method to fuse and
align appearance, motion, and linguistic features to achieve accurate
segmentation. Specifically, we propose a multi-modal video transformer, which
can fuse and aggregate multi-modal and temporal features between frames.
Furthermore, we design a language-guided feature fusion module to progressively
fuse appearance and motion features in each feature level with guidance from
linguistic features. Finally, a multi-modal alignment loss is proposed to
alleviate the semantic gap between features from different modalities.
Extensive experiments on A2D Sentences and J-HMDB Sentences verify the
performance and the generalization ability of our method compared to the
state-of-the-art methods.

本文提出了一种多模态视频分割方法，通过语言引导的特征融合模块和多模态对齐损失函数，将视觉外观、运动信息和语言特征融合，实现了精准的文本视频分割。在 A2D Sentences 和 J-HMDB Sentences 数据集上的实验表明，该方法与现有方法相比具有更好的性能和泛化能力。