Visual cues, like lip motion, have been shown to improve the performance of
Automatic Speech Recognition (ASR) systems in noisy environments. We propose
LipGER (Lip Motion aided Generative Error Correction), a novel framework for
leveraging visual cues for noise-robust ASR. Instead of learning the
cross-modal correlation between the audio and visual modalities, we make an LLM
learn the task of visually-conditioned (generative) ASR error correction.
Specifically, we instruct an LLM to predict the transcription from the N-best
hypotheses generated using ASR beam-search. This is further conditioned on lip
motions. This approach addresses key challenges in traditional AVSR learning,
such as the lack of large-scale paired datasets and difficulties in adapting to
new domains. We experiment on 4 datasets in various settings and show that
LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release
LipHyp, a large-scale dataset with hypothesis-transcription pairs that is
additionally equipped with lip motion cues to promote further research in this
space

利用唇部动作的视觉线索，LipGER 是一种新颖的框架，用于提高噪音环境下自动语音识别（ASR）系统的性能，通过令一个 LLM 学习任务来进行视觉条件下的 ASR 错误校正，大大改善了传统 AVSR 学习中的关键挑战。

LipGER：依赖视觉条件的生成式误差纠正用于鲁棒自动语音识别

LipGER: Visually-Conditioned Generative Error Correction for Robust  Automatic Speech Recognition

Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.

该研究旨在通过学习分解的音频 - 视觉表示来实现任意主题的对话面生成，并证明所学习的音频 - 视觉表示对于自动读唇和音频 - 视频检索任务非常有用。