This paper explores sentence-level Multilingual Visual Speech Recognition
with a single model for the first time. As the massive multilingual modeling of
visual data requires huge computational costs, we propose a novel strategy,
processing with visual speech units. Motivated by the recent success of the
audio speech unit, the proposed visual speech unit is obtained by discretizing
the visual speech features extracted from the self-supervised visual speech
model. To correctly capture multilingual visual speech, we first train the
self-supervised visual speech model on 5,512 hours of multilingual audio-visual
data. Through analysis, we verify that the visual speech units mainly contain
viseme information while suppressing non-linguistic information. By using the
visual speech units as the inputs of our system, we pre-train the model to
predict corresponding text outputs on massive multilingual data constructed by
merging several VSR databases. As both the inputs and outputs are discrete, we
can greatly improve the training efficiency compared to the standard VSR
training. Specifically, the input data size is reduced to 0.016% of the
original video inputs. In order to complement the insufficient visual
information in speech recognition, we apply curriculum learning where the
inputs of the system begin with audio-visual speech units and gradually change
to visual speech units. After pre-training, the model is finetuned on
continuous features. We set new state-of-the-art multilingual VSR performances
by achieving comparable performances to the previous language-specific VSR
models, with a single trained model.

该研究探索了使用单个模型的句子级多语种视觉语音识别，通过将视觉语音单元离散化作为输入，基于自监督视觉语音模型从 5,512 小时的多语种音频 - 视觉数据上进行训练，结合曲线学习改善语音识别中的视觉信息缺失，实现了与以往特定语言视觉语音识别模型相当的性能。