In this work, we present the Textless Vision-Language Transformer (TVLT),
where homogeneous transformer blocks take raw visual and audio inputs for
vision-and-language representation learning with minimal modality-specific
design, and do not use text-specific modules such as tokenization or automatic
speech recognition (ASR). TVLT is trained by reconstructing masked patches of
continuous video frames and audio spectrograms (masked autoencoding) and
contrastive modeling to align video and audio. TVLT attains performance
comparable to its text-based counterpart on various multimodal tasks, such as
visual question answering, image retrieval, video retrieval, and multimodal
sentiment analysis, with 28x faster inference speed and only 1/3 of the
parameters. Our findings suggest the possibility of learning compact and
efficient visual-linguistic representations from low-level visual and audio
signals without assuming the prior existence of text. Our code and checkpoints
are available at: this https URL

该研究提出了一种无需文本模块的视频与语言结合模型 ——Textless Vision-Language Transformer (TVLT)，采用均质的 transformer block 提取由视觉和语音输入组成的多模态信息，用 mask-autoencoding 和对比建模实现视频与音频的对齐，并在视觉问答、图片检索、视频检索以及多模态情感分析等多项任务中取得与有文本模块模型相当的表现，推测从低层视觉和音频信号中学习紧凑高效的视语表示的可能性。

TVLT: 无文本的视觉语言变换器

TVLT: Textless Vision-Language Transformer

Recently multimodal transformer models have gained popularity because their
performance on language and vision tasks suggest they learn rich
visual-linguistic representations. Focusing on zero-shot image retrieval tasks,
we study three important factors which can impact the quality of learned
representations: pretraining data, the attention mechanism, and loss functions.
By pretraining models on six datasets, we observe that dataset noise and
language similarity to our downstream task are important indicators of model
performance. Through architectural analysis, we learn that models with a
multimodal attention mechanism can outperform deeper models with modality
specific attention mechanisms. Finally, we show that successful contrastive
losses used in the self-supervised learning literature do not yield similar
performance gains when used in multimodal transformers

本文章论述通过训练多模态 transformer 模型，其在语言和视觉任务上的表现证明了其可以学习到丰富的视觉 - 语言表达。其着重于零样本图像检索任务，并研究了三个重要因素：预训练数据、注意机制和损失函数，以评估其对于模型性能的影响。