Audiovisual automatic speech recognition (AV-ASR) aims to improve the
robustness of a speech recognition system by incorporating visual information.
Training fully supervised multimodal models for this task from scratch, however
is limited by the need for large labelled audiovisual datasets (in each
downstream domain of interest). We present AVFormer, a simple method for
augmenting audio-only models with visual information, at the same time
performing lightweight domain adaptation. We do this by (i) injecting visual
embeddings into a frozen ASR model using lightweight trainable adaptors. We
show that these can be trained on a small amount of weakly labelled video data
with minimum additional training time and parameters. (ii) We also introduce a
simple curriculum scheme during training which we show is crucial to enable the
model to jointly process audio and visual information effectively; and finally
(iii) we show that our model achieves state of the art zero-shot results on
three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also
crucially preserving decent performance on traditional audio-only speech
recognition benchmarks (LibriSpeech). Qualitative results show that our model
effectively leverages visual information for robust speech recognition.

AVFormer 是一种简单的方法，使用轻量级可训练的适配器将视觉嵌入注入到冻结的语音识别模型中，并引入了一种培训方案。同时用小量且弱标注视频数据进行培训。实验结果表明，该方法在三个不同的音视频 ASR 基准（How2、VisSpeech 和 Ego4D）上取得了最先进的零 - shot 结果，同时在传统的仅语音识别基准（LibriSpeech）上表现良好。