This paper presents an audio visual automatic speech recognition (AV-ASR)
system using a Transformer-based architecture. We particularly focus on the
scene context provided by the visual information, to ground the ASR. We extract
representations for audio features in the encoder layers of the transformer and
fuse video features using an additional crossmodal multihead attention layer.
Additionally, we incorporate a multitask training criterion for multiresolution
ASR, where we train the model to generate both character and subword level
transcriptions.
Experimental results on the How2 dataset, indicate that multiresolution
training can speed up convergence by around 50% and relatively improves word
error rate (WER) performance by upto 18% over subword prediction models.
Further, incorporating visual information improves performance with relative
gains upto 3.76% over audio only models.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based
architectures.

本篇论文介绍了一个基于 Transformer 架构的音频视觉自动语音识别（AV-ASR）系统，特别关注视觉信息提供的场景背景，以支撑 ASR。我们从变换器的编码器层中提取音频特征的表示，并使用附加的跨模态多头注意层融合视频特征。此外，我们还采用多任务培训标准用于多分辨率 ASR，同时训练模型生成字符和子词级转录。实验结果表明，多分辨率训练可以加速收敛速度约 50％，并且相对于子词预测模型，单词错误率（WER）性能提高了高达 18％。此外，融合视觉信息可以改善表现，在仅使用音频模型的基础上，相对增益高达 3.76％。我们的结果可与最先进的 Listen, Attend and Spell 架构相媲美。