Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
TL;DR提出了一种新的预训练框架Multimodal Video Generative Pretraining (MV-GPT),通过利用未标记视频中的未来话语作为附加文本源并引入双向生成目标,以从生图像和录制语音直接生成说明的端到端模型来有效地生成多模态视频说明。
Abstract
Recent video and language pretraining frameworks lack the ability to generate sentences. We present multimodal video generative pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos