To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.

本文着眼于视频和语言之间的语义联系，提出了一种多级对齐训练方案，基于信息相似性从高层次的上下文到细粒度的语义，通过对称损失来对齐视频和语言的编码，从而在共享特征空间中确保相似信息紧密编码而不同语义的信息保持分开。我们的多级对齐训练可应用于各种视频和语言接地任务。连同任务特定的训练损失，我们的框架在多个视频 QA 和检索数据集上实现了与先前现有技术的可比较性能。

视频与语言联系定位的多级对齐训练方案