This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal
input corruption situations where audio inputs and visual inputs are both
corrupted, which is not well addressed in previous research directions.
Previous studies have focused on how to complement the corrupted audio inputs
with the clean visual inputs with the assumption of the availability of clean
visual inputs. However, in real life, clean visual inputs are not always
accessible and can even be corrupted by occluded lip regions or noises. Thus,
we firstly analyze that the previous AVSR models are not indeed robust to the
corruption of multimodal input streams, the audio and the visual inputs,
compared to uni-modal models. Then, we design multimodal input corruption
modeling to develop robust AVSR models. Lastly, we propose a novel AVSR
framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that
is robust to the corrupted multimodal inputs. The AV-RelScore can determine
which input modal stream is reliable or not for the prediction and also can
exploit the more reliable streams in prediction. The effectiveness of the
proposed method is evaluated with comprehensive experiments on popular
benchmark databases, LRS2 and LRS3. We also show that the reliability scores
obtained by AV-RelScore well reflect the degree of corruption and make the
proposed model focus on the reliable multimodal representations.

本文针对音频和视频同时受损的多模态输入情况下的视音频说话人识别问题展开研究，通过分析现有模型的不足并引入多模态输入损坏模型来设计一个稳健的 AVSR 模型框架，即 AV-RelScore，通过可靠性得分确定可靠输入流并提高识别准确度。