Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve
performance in noise. Since videos are harder to obtain than audio, the video
training data of AVSR models is usually limited to a few thousand hours. In
contrast, speech models such as Whisper are trained with hundreds of thousands
of hours of data, and thus learn a better speech-to-text decoder. The huge
training data difference motivates us to adapt Whisper to handle video inputs.
Inspired by Flamingo which injects visual features into language models, we
propose Whisper-Flamingo which integrates visual features into the Whisper
speech recognition and translation model with gated cross attention. Our
audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech
recognition and En-X translation for 6 languages in noisy conditions. Moreover,
Whisper-Flamingo is a versatile model and conducts all of these tasks using one
set of parameters, while prior methods are trained separately on each language.

Audio-Visual Speech Recognition (AVSR) uses Whisper-Flamingo, a model that integrates visual features, to improve speech recognition and translation performance in noisy conditions for multiple languages.

Whisper-Flamingo: 集成视觉特征于 Whisper 中用于音频 - 视觉语音识别和翻译

Whisper-Flamingo: Integrating Visual Features into Whisper for  Audio-Visual Speech Recognition and Translation

This work presents an extensive and detailed study on Audio-Visual Speech
Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English,
Arabic, and French. We have collected large-scale datasets for each language
except for English, and have engaged in the training of supervised learning
models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in
competitive performance on newly established benchmarks for each language. The
datasets and models are released to the community with an aim to serve as a
foundation for triggering and feeding further research work and exploration on
Audio-Visual Speech Recognition, an increasingly important area of research.
Code available at
\href{https://github.com/YasserdahouML/visper}{this https URL}.

该研究在中文、西班牙语、英语、阿拉伯语和法语这五种常用语言上，对音视频语音识别（AVSR）进行了广泛而详细的研究。通过收集大规模的数据集并进行有监督学习模型的训练，在多语言环境中训练的 ViSpeR 模型在每种语言的最新基准测试中表现出竞争力。该研究通过提供数据集和模型给研究社区，旨在为音视频语音识别领域的进一步研究和探索奠定基础。

ViSpeR: 多语言音视频语音识别

ViSpeR: Multilingual Audio-Visual Speech Recognition

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic
speech recognition (ASR), using video as a complement to audio. In AVSR,
considerable efforts have been directed at datasets for facial features such as
lip-readings, while they often fall short in evaluating the image comprehension
capabilities in broader contexts. In this paper, we construct SlideAVSR, an
AVSR dataset using scientific paper explanation videos. SlideAVSR provides a
new benchmark where models transcribe speech utterances with texts on the
slides on the presentation recordings. As technical terminologies that are
frequent in paper explanations are notoriously challenging to transcribe
without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR
problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR
model that can refer to textual information from slides, and confirm its
effectiveness on SlideAVSR.

通过构建 SlideAVSR 数据集，这篇论文提出了一种用于科学论文解释视频的 AVSR 数据集，旨在为模型提供在演示录音中将语音话语转录为滑动演示文本的基准评估。同时，论文还介绍了一种名为 DocWhisper 的简单但有效的 AVSR 模型，它可以参考来自幻灯片的文本信息，并在 SlideAVSR 数据集上验证其有效性。