Although speech is a simple and effective way for humans to communicate with
the outside world, a more realistic speech interaction contains multimodal
information, e.g., vision, text. How to design a unified framework to integrate
different modal information and leverage different resources (e.g.,
visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to
facilitate speech representation learning was not well explored. In this paper,
we propose a unified cross-modal representation learning framework VATLM
(Visual-Audio-Text Language Model). The proposed VATLM employs a unified
backbone network to model the modality-independent information and utilizes
three simple modality-dependent modules to preprocess visual, speech, and text
inputs. In order to integrate these three modalities into one shared semantic
space, VATLM is optimized with a masked prediction task of unified tokens,
given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on
audio-visual related downstream tasks, including audio-visual speech
recognition (AVSR), visual speech recognition (VSR) tasks. Results show that
the proposed VATLM outperforms previous the state-of-the-art models, such as
audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that
VATLM is capable of aligning different modalities into the same space. To
facilitate future research, we release the code and pre-trained models at
this https URL.

本文采用统一的跨模态表示学习框架 VATLM，通过模态无关信息建模、模态依赖模块预处理视觉、语音、文本输入，以及使用统一分词器掩蔽预测任务来将三个模态集成到一个共享语义空间中，优化下游任务的结果表明，VATLM 在音频 - 视觉相关的下游任务中的表现超过了先前的最先进模型，并且能够将不同的语言类型对齐到同一个语义空间。