The limits of applicability of vision-and-language models are defined by the
coverage of their training data. Tasks like vision question answering (VQA)
often require commonsense and factual information beyond what can be learned
from task-specific datasets. This paper investigates the injection of knowledge
from general-purpose knowledge bases (KBs) into vision-and-language
transformers. We use an auxiliary training objective that encourages the
learned representations to align with graph embeddings of matching entities in
a KB. We empirically study the relevance of various KBs to multiple tasks and
benchmarks. The technique brings clear benefits to knowledge-demanding question
answering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge
absent from existing models. More surprisingly, the technique also benefits
visual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and
show that the injection of additional knowledge regularizes the space of
embeddings, which improves the representation of lexical and semantic
similarities. The technique is model-agnostic and can expand the applicability
of any vision-and-language transformer with minimal computational overhead.

本文研究了将通用知识库中的知识注入视觉 - 语言模型中，并通过辅助训练目标增加了语义和关系知识的表征，实现了对问题回答、视觉推理等任务中的性能提升，这种技术不依赖于特定的模型，具有较小的计算开销。

视觉和语言推理：探索补充知识的益处

Reasoning over Vision and Language: Exploring the Benefits of  Supplemental Knowledge

Pre-trained contextual vision-and-language (V&L) models have achieved
impressive performance on various benchmarks. However, existing models require
a large amount of parallel image-caption data for pre-training. Such data are
costly to collect and require cumbersome curation. Inspired by unsupervised
machine translation, we investigate if a strong V&L representation model can be
learned through unsupervised pre-training without image-caption corpora. In
particular, we propose to conduct ``mask-and-predict'' pre-training on
text-only and image-only corpora and introduce the object tags detected by an
object recognition model as anchor points to bridge two modalities. We find
that such a simple approach achieves performance close to a model pre-trained
with aligned data, on four English V&L benchmarks. Our work challenges the
widely held notion that aligned data is necessary for V&L pre-training, while
significantly reducing the amount of supervision needed for V&L models.

通过无监督预训练实现视觉和语言模型的学习，使用 “mask-and-predict” 方法预训练文本和图像数据，并引入目标识别模型检测到的对象标签作为两种模式之间的桥梁，在四个英语视觉和语言基准测试中获得了接近于使用对齐数据预训练的模型的性能，挑战了对于 V&L 预训练来说，对齐数据是必要的广泛看法，并显著减少了 V&L 模型的监督所需量。

无监督的视觉与语言预训练：无需平行图像和文本

Unsupervised Vision-and-Language Pre-training Without Parallel Images  and Captions

Mirroring the success of masked language models, vision-and-language
counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art
performance on a variety of multimodal discriminative tasks like visual
question answering and visual grounding. Recent work has also successfully
adapted such models towards the generative task of image captioning. This begs
the question: Can these models go the other way and generate images from pieces
of text? Our analysis of a popular representative from this model family -
LXMERT - finds that it is unable to generate rich and semantically meaningful
imagery with its current training setup. We introduce X-LXMERT, an extension to
LXMERT with training refinements including: discretizing visual
representations, using uniform masking with a large range of masking ratios and
aligning the right pre-training datasets to the right objectives which enables
it to paint. X-LXMERT's image generation capabilities rival state of the art
generative models while its question answering and captioning abilities remains
comparable to LXMERT. Finally, we demonstrate the generality of these training
refinements by adding image generation capabilities into UNITER to produce
X-UNITER.

本文研究了图像生成模型中的视觉语言模型（V&L 模型）LXMERT，发现其效果不如其他图像生成模型，因此提出了 X-LXMERT 模型，通过训练优化使其生成图像的能力媲美最先进的生成模型，同时保持了它在问答和字幕生成任务上的优秀表现，并证明这些训练优化可以推广到其他 V&L 模型上。