We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including masked language model, masked object label prediction and visual-linguistic matching. The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large amounts of image-caption pairs, we transfer Unicoder-VL to image-text retrieval tasks with just one additional output layer, and achieve state-of-the-art performances on both MSCOCO and Flicker30K.

通过联合学习视觉和语言的表示，Unicoder-VL提供了一个通用编码器，采用多任务交叉训练，包括掩码语言建模，掩码对象分类和视觉语言匹配等。在大规模图像字幕预训练之后，Unicoder-VL可用于基于字幕的图像文本检索和视觉常识推理，取得了领先或可比的结果，展示了交叉模态预训练的强大能力。

Unicoder-VL: 一个视觉语言通用编码器，通过交叉模态预训练实现