Audio-visual automatic speech recognition (AV-ASR) extends speech recognition
by introducing the video modality as an additional source of information. In
this work, the information contained in the motion of the speaker's mouth is
used to augment the audio features. The video modality is traditionally
processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks arXiv:2010.11929 demonstrated the ability
to extract rich visual features for image classification tasks. Here, we
propose to replace the 3D convolution with a video transformer to extract
visual features. We train our baselines and the proposed model on a large scale
corpus of YouTube videos. The performance of our approach is evaluated on a
labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our
best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10%
and 15% relative improvements over our convolutional baseline. We achieve the
state of the art performance of the audio-visual recognition on the LRS3-TED
after fine-tuning our model (1.6% WER). In addition, in a series of experiments
on multi-person AV-ASR, we obtained an average relative reduction of 2% over
our convolutional video frontend.

本文提出使用视频变压器替换三维卷积进行视觉特征提取，从而提高音频 - 视觉自动语音识别的性能，并在大规模的 YouTube 视频语料库以及 LRS3-TED 公共语料库上进行了评估。实验结果表明，该方法在 LRS3-TED 上取得了国际领先的性能表现。另外，在多人音频 - 视觉自动语音识别方面，该方法相对于三维卷积实现了平均降低 2% 的性能损失。