Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.

本文提出了一种用于视觉和语言理解与生成的端到端的视觉-语言预训练模型 E2E-VLP，其中我们建立了一个统一的 Transformer 框架来共同学习视觉表示和图像文本语义对齐，同时通过将目标检测和图像字幕生成任务整合到预训练中，采用统一的编码-解码结构增强了视觉学习。在广泛的视觉-语言相关下游任务中进行的一系列实验表明了该新 VLP 模型的有效性。

E2E-VLP: 结合视觉学习的端到端视觉-语言预训练