State-of-the-art vision and vision-and-language models rely on large-scale
visio-linguistic pretraining for obtaining good performance on a variety of
downstream tasks. Generally, such models are often either cross-modal
(contrastive) or multi-modal (with earlier fusion) but not both; and they often
only target specific modalities or tasks. A promising direction would be to use
a single holistic universal model, as a "foundation", that targets all
modalities at once -- a true vision and language foundation model should be
good at vision tasks, language tasks, and cross- and multi-modal vision and
language tasks. We introduce FLAVA as such a model and demonstrate impressive
performance on a wide range of 35 tasks spanning these target modalities.

本篇研究提出了一种名为 FLAVA 的综合视觉与语言基础模型，通过使用单一的综合的通用模型，同时针对视觉和语言任务以及跨模态任务，展现出出色的性能表现。

FLAVA：一种基础语言和视觉对齐模型

FLAVA: A Foundational Language And Vision Alignment Model

In this paper, we propose to investigate the problem of out-of-domain
visio-linguistic pretraining, where the pretraining data distribution differs
from that of downstream data on which the pretrained model will be fine-tuned.
Existing methods for this problem are purely likelihood-based, leading to the
spurious correlations and hurt the generalization ability when transferred to
out-of-domain downstream tasks. By spurious correlation, we mean that the
conditional probability of one token (object or word) given another one can be
high (due to the dataset biases) without robust (causal) relationships between
them. To mitigate such dataset biases, we propose a Deconfounded
Visio-Linguistic Bert framework, abbreviated as DeVLBert, to perform
intervention-based learning. We borrow the idea of the backdoor adjustment from
the research field of causality and propose several neural-network based
architectures for Bert-style out-of-domain pretraining. The quantitative
results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and
Visual Question Answering, show the effectiveness of DeVLBert by boosting
generalization ability.

本文提出了 Deconfounded Visio-Linguistic Bert 框架，解决了视觉语言预训练中的跨域问题，并通过干预学习减轻数据集偏差，从而提高了模型的泛化能力。