Most humans use visual imagination to understand and reason about language,
but models such as BERT reason about language using knowledge acquired during
text-only pretraining. In this work, we investigate whether vision-and-language
pretraining can improve performance on text-only tasks that involve implicit
visual reasoning, focusing primarily on zero-shot probing methods. We propose a
suite of visual language understanding (VLU) tasks for probing the visual
reasoning abilities of text encoder models, as well as various non-visual
natural language understanding (NLU) tasks for comparison. We also contribute a
novel zero-shot knowledge probing method, Stroop probing, for applying models
such as CLIP to text-only tasks without needing a prediction head such as the
masked language modelling head of models like BERT. We show that SOTA
multimodally trained text encoders outperform unimodally trained text encoders
on the VLU tasks while being underperformed by them on the NLU tasks, lending
new context to previously mixed results regarding the NLU capabilities of
multimodal models. We conclude that exposure to images during pretraining
affords inherent visual reasoning knowledge that is reflected in language-only
tasks that require implicit visual reasoning. Our findings bear importance in
the broader context of multimodal learning, providing principled guidelines for
the choice of text encoders used in such contexts.

本研究探讨了图像与语言预训练是否可以提高模型在需要隐含视觉推理的文本任务上的性能，提出了一系列用于探测文本编码模型视觉推理能力的任务，并说明了采用多模态预训练方法可以提高文本编码器的性能。

BERT 是否盲目？探索视觉语言预训练对视觉语言理解的影响

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

The interest in Artificial Intelligence (AI) and its applications has seen
unprecedented growth in the last few years. The success can be partly
attributed to the advancements of deep neural networks made in the sub-fields
of AI such as Computer Vision (CV) and Natural Language Processing (NLP). The
promising research area that this dissertation focuses on is visual and
language understanding which involves many challenging tasks, i.e.,
classification, detection, segmentation, machine translation and captioning,
etc. The state-of-the-art methods for solving these problems usually involves
only two parts: source data and target labels, which is rather insufficient
especially when the dataset is small. Meanwhile, many external tools or sources
can provide extra useful information (external knowledge) that can help improve
the performance of these methods. For example, a detection model has been
applied to provide better object features than state-of-the-art ResNet for
image captioning models. Inspired by this observation, we developed a
methodology that we can first extract external knowledge and then integrate it
with the original models. The external knowledge has to be extracted from the
dataset, or can directly come from external, e.g., grammar rules or scene
graphs. We apply this methodology to different AI tasks, including machine
translation and image captioning and improve the original state-of-the-art
models by a large margin.

本文介绍了一种利用外部知识提高人工智能任务性能的方法，将其应用于视觉语言理解、机器翻译和图像字幕制作等任务，并且在这些任务建模中显著提升表现。

探索外部知识以准确地建模视觉和语言问题

Exploring External Knowledge for Accurate modeling of Visual and Language Problems

Large-scale pre-training has recently revolutionized vision-and-language (VL)
research. Models such as LXMERT and UNITER have significantly lifted the state
of the art over a wide range of VL tasks. However, the large number of
parameters in such models hinders their application in practice. In parallel,
work on the lottery ticket hypothesis (LTH) has shown that deep neural networks
contain small matching subnetworks that can achieve on par or even better
performance than the dense networks when trained in isolation. In this work, we
perform the first empirical study to assess whether such trainable subnetworks
also exist in pre-trained VL models. We use UNITER as the main testbed (also
test on LXMERT and ViLT), and consolidate 7 representative VL tasks for
experiments, including visual question answering, visual commonsense reasoning,
visual entailment, referring expression comprehension, image-text retrieval,
GQA, and NLVR$^2$. Through comprehensive analysis, we summarize our main
findings as follows. ($i$) It is difficult to find subnetworks that strictly
match the performance of the full model. However, we can find "relaxed" winning
tickets at 50%-70% sparsity that maintain 99% of the full accuracy. ($ii$)
Subnetworks found by task-specific pruning transfer reasonably well to the
other tasks, while those found on the pre-training tasks at 60%/70% sparsity
transfer universally, matching 98%/96% of the full accuracy on average over all
the tasks. ($iii$) Besides UNITER, other models such as LXMERT and ViLT can
also play lottery tickets. However, the highest sparsity we can achieve for
ViLT is far lower than LXMERT and UNITER (30% vs. 70%). ($iv$) LTH also remains
relevant when using other training methods (e.g., adversarial training).

本文通过实证研究发现，大规模的预训练 VL 模型中存在可训练的子网络，该子网络可通过精细修剪结构实现高精度计算并具有良好的通用性。

视觉和语言中的抽奖券玩法

Playing Lottery Tickets with Vision and Language

We present ALFRED (Action Learning From Realistic Environments and
Directives), a benchmark for learning a mapping from natural language
instructions and egocentric vision to sequences of actions for household tasks.
ALFRED includes long, compositional tasks with non-reversible state changes to
shrink the gap between research benchmarks and real-world applications. ALFRED
consists of expert demonstrations in interactive visual environments for 25k
natural language directives. These directives contain both high-level goals
like "Rinse off a mug and place it in the coffee maker." and low-level language
instructions like "Walk to the coffee maker on the right." ALFRED tasks are
more complex in terms of sequence length, action space, and language than
existing vision-and-language task datasets. We show that a baseline model based
on recent embodied vision-and-language tasks performs poorly on ALFRED,
suggesting that there is significant room for developing innovative grounded
visual language understanding models with this benchmark.

ALFRED 是一个用于学习自然语言指令和自我中心视觉到家庭任务动作序列映射的基准测试，包括 25k 个自然语言指令的交互式视觉环境的专家演示，并在序列长度，动作空间和语言方面比现有的视觉和语言任务数据集更复杂。