We propose a new two-stage pre-training framework for video-to-text
generation tasks such as video captioning and video question answering: A
generative encoder-decoder model is first jointly pre-trained on massive
image-text data to learn fundamental vision-language concepts, and then adapted
to video data in an intermediate video-text pre-training stage to learn
video-specific skills such as spatio-temporal reasoning. As a result, our
VideoOFA model achieves new state-of-the-art performance on four Video
Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr
score. It also outperforms existing models on two open-ended Video Question
Answering datasets, showcasing its generalization capability as a universal
video-to-text model.

该研究提出了一种新的两阶段预训练框架来生成视频描述和回答问题，称为 VideoOFA 模型，在大规模图像 - 文本数据上预先训练表示学习，然后在中间视频 - 文本预训练阶段仅适应于视频数据来学习时空推理等视频特定技能，这使得该模型在四个视频描述基准测试中实现了新的最优表现，并在两个开放式的视频问答数据集上优于现有模型，展示了其作为通用视频 - 文本模型的泛化能力。