In this study, we try to address the problem of leveraging visual signals to
improve Automatic Speech Recognition (ASR), also known as visual context-aware
ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text
representations extracted by a self-supervised pre-trained text-video embedding
model. Firstly, we propose a multi-stream attention architecture to leverage
signals from both audio and video modalities. This architecture consists of
separate encoders for the two modalities and a single decoder that attends over
them. We show that this architecture is better than fusing modalities at the
signal level. Additionally, we also explore leveraging the visual information
in a second pass model, which has also been referred to as a `deliberation
model'. The deliberation model accepts audio representations and text
hypotheses from the first pass ASR and combines them with a visual stream for
an improved visual context-aware recognition. The proposed deliberation scheme
can work on top of any well trained ASR and also enabled us to leverage the
pre-trained text model to ground the hypotheses with the visual features. Our
experiments on HOW2 dataset show that multi-stream and deliberation
architectures are very effective at the VC-ASR task. We evaluate the proposed
models for two scenarios; clean audio stream and distorted audio in which we
mask out some specific words in the audio. The deliberation model outperforms
the multi-stream model and achieves a relative WER improvement of 6% and 8.7%
for the clean and masked data, respectively, compared to an audio-only model.
The deliberation model also improves recovering the masked words by 59%
relative.

本研究致力于解决利用视觉信号来提高语音识别（ASR）的问题，探讨了一种基于自监督预训练的文本视频嵌入模型的视觉上下文感知 ASR 方法，该方法包括多流注意力结构和熟思（deliberation）模型，利用视觉信息的熟思模型比多流模型在干扰噪声下提高了语音识别正确率和恢复被屏蔽单词的准确率。