We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an
extension of the popular Vision-and-Language Transformer (ViLT), and improves
performance on vision-and-language (VL) tasks that involve more complex text
inputs than image captions while having minimal impact on training and
inference efficiency. ViLT, importantly, enables efficient training and
inference in VL tasks, achieved by encoding images using a linear projection of
patches instead of an object detector. However, it is pretrained on captioning
datasets, where the language input is simple, literal, and descriptive,
therefore lacking linguistic diversity. So, when working with multimedia data
in the wild, such as multimodal social media data, there is a notable shift
from captioning language data, as well as diversity of tasks. We indeed find
evidence that the language capacity of ViLT is lacking. The key insight and
novelty of VAuLT is to propagate the output representations of a large language
model (LM) like BERT to the language input of ViLT. We show that joint training
of the LM and ViLT can yield relative improvements up to 20% over ViLT and
achieve state-of-the-art or comparable performance on VL tasks involving richer
language inputs and affective constructs, such as for Target-Oriented Sentiment
Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification
in MVSA-Single and MVSA-Multiple. Our code is available at
this https URL.

本研究提出了 Vision-and-Augmented-Language Transformer（VAuLT），以传播大型语言模型（LM）BERT 的输出表示到 ViLT 的语言输入为核心思想，VAuLT 在包括富语言输入和情感结构在内的视听任务中相对于 ViLT 获得了高达 20% 的性能提升，并在 TWITTER-2015，TWITTER-2017，MVSA-Single 和 MVSA-Multiple 的情感分类任务中取得了与现有技术相媲美的表现。

VAuLT: 增强视觉与语言 Transformer 用于社交媒体情感分类

VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media

Vision-and-Language Pre-training (VLP) has improved performance on various
joint vision-and-language downstream tasks. Current approaches to VLP heavily
rely on image feature extraction processes, most of which involve region
supervision (e.g., object detection) and the convolutional architecture (e.g.,
ResNet). Although disregarded in the literature, we find it problematic in
terms of both (1) efficiency/speed, that simply extracting input features
requires much more computation than the multimodal interaction steps; and (2)
expressive power, as it is upper bounded to the expressive power of the visual
embedder and its predefined visual vocabulary. In this paper, we present a
minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the
sense that the processing of visual inputs is drastically simplified to just
the same convolution-free manner that we process textual inputs. We show that
ViLT is up to tens of times faster than previous VLP models, yet with
competitive or better downstream task performance. Our code and pre-trained
weights are available at this https URL

本文提出了一种新的 Vision-and-Language Pre-training 模型 ViLT，它是一种单体模型，与文本输入处理方式相同，并通过多模态交互步骤实现视觉输入处理。ViLT 通过简化图像输入处理过程，使得模型训练更加高效，可以有效地提高下游任务的性能表现。