The ability to model intra-modal and inter-modal interactions is fundamental
in multimodal machine learning. The current state-of-the-art models usually
adopt deep learning models with fixed structures. They can achieve exceptional
performances on specific tasks, but face a particularly challenging problem of
modality mismatch because of diversity of input modalities and their fixed
structures. In this paper, we present \textbf{Switch-BERT} for joint vision and
language representation learning to address this problem. Switch-BERT extends
BERT architecture by introducing learnable layer-wise and cross-layer
interactions. It learns to optimize attention from a set of attention modes
representing these interactions. One specific property of the model is that it
learns to attend outputs from various depths, therefore mitigates the modality
mismatch problem. We present extensive experiments on visual question
answering, image-text retrieval and referring expression comprehension
experiments. Results confirm that, whereas alternative architectures including
ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently
achieve better or comparable performances than the current state-of-the-art
models in these tasks. Ablation studies indicate that the proposed model
achieves superior performances due to its ability in learning task-specific
multimodal interactions.

本文提出了一种名为 Switch-BERT 的多模态机器学习模型，它通过引入可学习的分层和交叉层交互来优化注意力集合，从而解决了多样输入模态和固定结构下的模态不匹配问题，实现了良好的视觉问答、图像文本检索和指代表达理解等任务的性能。

Switch-BERT: 通过切换注意力和输入来学习建模多模态交互

Switch-BERT: Learning to Model Multimodal Interactions by Switching  Attention and Input

We present VILLA, the first known effort on large-scale adversarial training
for vision-and-language (V+L) representation learning. VILLA consists of two
training stages: (i) task-agnostic adversarial pre-training; followed by (ii)
task-specific adversarial finetuning. Instead of adding adversarial
perturbations on image pixels and textual tokens, we propose to perform
adversarial training in the embedding space of each modality. To enable
large-scale training, we adopt the "free" adversarial training strategy, and
combine it with KL-divergence-based regularization to promote higher invariance
in the embedding space. We apply VILLA to current best-performing V+L models,
and achieve new state of the art on a wide range of tasks, including Visual
Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval,
Referring Expression Comprehension, Visual Entailment, and NLVR2.

该研究提出了一种基于对抗训练的大规模视觉语言表示学习方法 VILLA，该方法在嵌入空间内进行对抗训练，取得了新的最优性能。