We present ViLBERT (short for Vision-and-Language BERT), a model for learning
task-agnostic joint representations of image content and natural language. We
extend the popular BERT architecture to a multi-modal two-stream model,
pro-cessing both visual and textual inputs in separate streams that interact
through co-attentional transformer layers. We pretrain our model through two
proxy tasks on the large, automatically collected Conceptual Captions dataset
and then transfer it to multiple established vision-and-language tasks --
visual question answering, visual commonsense reasoning, referring expressions,
and caption-based image retrieval -- by making only minor additions to the base
architecture. We observe significant improvements across tasks compared to
existing task-specific models -- achieving state-of-the-art on all four tasks.
Our work represents a shift away from learning groundings between vision and
language only as part of task training and towards treating visual grounding as
a pretrainable and transferable capability.

ViLBERT 是一种用于学习图像内容和自然语言的任务不可知联合表示的模型，并通过在多模态两个流中处理图像和文本输入，通过相互关注变压器层实现交互。我们通过在大型自动收集的概念字幕数据集上执行两个代理任务来预训练我们的模型，然后通过仅对基础体系结构进行轻微添加，将其转移到多个已建立的视觉语言任务 —— 视觉问答、视觉常识推理、指称表达和基于字幕的图像检索，我们观察到与现有特定任务模型相比，在所有四个任务中都实现了显着的改进，成为学习视觉与语言之间接地只作为任务培训的一部分，而不是对待视觉接地作为可预训练和可转移能力的代表性工作。