Text-based video segmentation aims to segment the target object in a video
based on a describing sentence. Incorporating motion information from optical
flow maps with appearance and linguistic modalities is crucial yet has been
largely ignored by previous work. In this paper, we design a method to fuse and
align appearance, motion, and linguistic features to achieve accurate
segmentation. Specifically, we propose a multi-modal video transformer, which
can fuse and aggregate multi-modal and temporal features between frames.
Furthermore, we design a language-guided feature fusion module to progressively
fuse appearance and motion features in each feature level with guidance from
linguistic features. Finally, a multi-modal alignment loss is proposed to
alleviate the semantic gap between features from different modalities.
Extensive experiments on A2D Sentences and J-HMDB Sentences verify the
performance and the generalization ability of our method compared to the
state-of-the-art methods.

本文提出了一种多模态视频分割方法，通过语言引导的特征融合模块和多模态对齐损失函数，将视觉外观、运动信息和语言特征融合，实现了精准的文本视频分割。在 A2D Sentences 和 J-HMDB Sentences 数据集上的实验表明，该方法与现有方法相比具有更好的性能和泛化能力。

基于多模态特征的文本视频分割运动建模

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Text-based video segmentation aims to segment an actor in video sequences by
specifying the actor and its performing action with a textual query. Previous
methods fail to explicitly align the video content with the textual query in a
fine-grained manner according to the actor and its action, due to the problem
of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two
modalities contain different amounts of semantic information during the
multi-modal fusion process. To alleviate this problem, we propose a novel actor
and action modular network that individually localizes the actor and its action
in two separate modules. Specifically, we first learn the actor-/action-related
content from the video and textual query, and then match them in a symmetrical
manner to localize the target tube. The target tube contains the desired actor
and action which is then fed into a fully convolutional network to predict
segmentation masks of the actor. Our method also establishes the association of
objects cross multiple frames with the proposed temporal proposal aggregation
mechanism. This enables our method to segment the video effectively and keep
the temporal consistency of predictions. The whole model is allowed for joint
learning of the actor-action matching and segmentation, as well as achieves the
state-of-the-art performance for both single-frame segmentation and full video
segmentation on A2D Sentences and J-HMDB Sentences datasets.

本文提出了一种基于文本的视频分割方法，通过引入一个新的演员和动作的模块化网络，解决了语义不对称问题，同时提出了时间提案聚合机制，获得了单帧分割和全视频分割的最先进性能。