Since facial actions such as lip movements contain significant information
about speech content, it is not surprising that audio-visual speech enhancement
methods are more accurate than their audio-only counterparts. Yet,
state-of-the-art approaches still struggle to generate clean, realistic speech
without noise artifacts and unnatural distortions in challenging acoustic
environments. In this paper, we propose a novel audio-visual speech enhancement
framework for high-fidelity telecommunications in AR/VR. Our approach leverages
audio-visual speech cues to generate the codes of a neural speech codec,
enabling efficient synthesis of clean, realistic speech from noisy signals.
Given the importance of speaker-specific cues in speech, we focus on developing
personalized models that work well for individual speakers. We demonstrate the
efficacy of our approach on a new audio-visual speech dataset collected in an
unconstrained, large vocabulary setting, as well as existing audio-visual
datasets, outperforming speech enhancement baselines on both quantitative
metrics and human evaluation studies. Please see the supplemental video for
qualitative results at
this https URL

本文提出了一种新的音频 - 视觉语音增强框架，利用个人化模型和神经语音编解码器从嘈杂的信号中高效合成真实干净的语音，以提高增强幅度和视角方面的质量。

音视频语音编解码器：重新思考音视频语音增强通过再合成的方法

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement  by Re-Synthesis

Humans involuntarily tend to infer parts of the conversation from lip
movements when the speech is absent or corrupted by external noise. In this
work, we explore the task of lip to speech synthesis, i.e., learning to
generate natural speech given only the lip movements of a speaker.
Acknowledging the importance of contextual and speaker-specific cues for
accurate lip-reading, we take a different path from existing works. We focus on
learning accurate lip sequences to speech mappings for individual speakers in
unconstrained, large vocabulary settings. To this end, we collect and release a
large-scale benchmark dataset, the first of its kind, specifically to train and
evaluate the single-speaker lip to speech task in natural settings. We propose
a novel approach with key design choices to achieve accurate, natural lip to
speech synthesis in such unconstrained scenarios for the first time. Extensive
evaluation using quantitative, qualitative metrics and human evaluation shows
that our method is four times more intelligible than previous works in this
space. Please check out our demo video for a quick overview of the paper,
method, and qualitative results.
this https URL&feature=youtu.be

本文提出了一种基于说话者唇部运动的语音合成方法，通过收集唇部运动大规模数据集并针对唇读单个说话者在自然环境下的情况进行模型设计，该模型可以更准确、自然地模拟说话者的语音，其量化、定性评估结果表明，该方法比现有方法的可理解性提高了四倍。