Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi
TL;DR本文提出了一种高效且有效的视频和语言预训练框架——Align and Prompt,通过提出视频-文本对比(VTC)损失和提示实体建模(PEM)任务来更好地跨模态对齐,从而学习精细的区域-实体对齐,与以前的方法相比取得了显着性能提升。
Abstract
video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, lea