Human language is grounded on multimodal knowledge including visual knowledge
like colors, sizes, and shapes. However, current large-scale pre-trained
language models rely on text-only self-supervised training with massive text
data, which precludes them from utilizing relevant visual information when
necessary. To address this, we propose a novel pre-training framework, named
VaLM, to Visually-augment text tokens with retrieved relevant images for
Language Modeling. Specifically, VaLM builds on a novel latent text-image
alignment method via an image retrieval module to fetch corresponding images
given a textual context. With the visually-augmented context, VaLM uses a
visual knowledge fusion layer to enable multimodal grounded language modeling
by attending to both text context and visual knowledge in images. We evaluate
VaLM on various visual knowledge-intensive commonsense reasoning tasks, which
require visual information to excel. The experimental results illustrate that
VaLM outperforms all strong language-only and vision-language baselines with
substantial gains in reasoning object commonsense including color, size, and
shape. Our code is available at this https URL.

提出了一种名为 VaLM 的预训练框架，对语言建模进行视觉增强，通过图像检索模块检索相应图像，使用视觉知识融合层使多模态语言建模可以参考文本和图像的视觉知识，并在需要的情况下获取相关联的图片，通过对各种视觉知识密集型的常识推理任务的评估，展示了 VaLM 在推理对象的常识，包括颜色、大小和形状方面的性能优于强语言和视觉语言基线。