Audio and visual modalities are inherently connected in speech signals: lip
movements and facial expressions are correlated with speech sounds. This
motivates studies that incorporate the visual modality to enhance an acoustic
speech signal or even restore missing audio information. Specifically, this
paper focuses on the problem of audio-visual speech inpainting, which is the
task of synthesizing the speech in a corrupted audio segment in a way that it
is consistent with the corresponding visual content and the uncorrupted audio
context. We present an audio-visual transformer-based deep learning model that
leverages visual cues that provide information about the content of the
corrupted audio. It outperforms the previous state-of-the-art audio-visual
model and audio-only baselines. We also show how visual features extracted with
AV-HuBERT, a large audio-visual transformer for speech recognition, are
suitable for synthesizing speech.

本文提出了基于 Transformer 的深度学习模型来解决音频视觉语音修复问题，该模型利用视觉线索提供有关受损音频内容的信息。实验结果表明该模型优于之前的最先进的基于音频 - 视觉的模型和仅基于音频的基准模型，同时说明了使用 AV-HuBERT 提取的视觉特征可以合成语音。

基于上下文的语音修补：以视频为导向的语音合成

Speech inpainting: Context-based speech synthesis guided by video

Audio-based automatic speech recognition (ASR) degrades significantly in
noisy environments and is particularly vulnerable to interfering speech, as the
model cannot determine which speaker to transcribe. Audio-visual speech
recognition (AVSR) systems improve robustness by complementing the audio stream
with the visual information that is invariant to noise and helps the model
focus on the desired speaker. However, previous AVSR work focused solely on the
supervised learning setup; hence the progress was hindered by the amount of
labeled data available. In this work, we present a self-supervised AVSR
framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art
audio-visual speech representation learning model. On the largest available
AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by
~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in
the presence of babble noise, while reducing the WER of an audio-based model by
over 75% (25.8% vs. 5.8%) on average.

本文提出了一个基于 AV-HuBERT 模型的自监督音视频言语识别框架，利用 LRS3 数据集的少量标记数据，在噪音干扰的情况下提高了超过 50% 的性能，并且比基于音频的模型将词错误率减少了 75% 以上。

鲁棒性自监督视听语音识别

Robust Self-Supervised Audio-Visual Speech Recognition

Video recordings of speech contain correlated audio and visual information,
providing a strong signal for speech representation learning from the speaker's
lip movements and the produced sound. We introduce Audio-Visual Hidden Unit
BERT (AV-HuBERT), a self-supervised representation learning framework for
audio-visual speech, which masks multi-stream video input and predicts
automatically discovered and iteratively refined multimodal hidden units.
AV-HuBERT learns powerful audio-visual speech representation benefiting both
lip-reading and automatic speech recognition. On the largest public lip-reading
benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of
labeled data, outperforming the former state-of-the-art approach (33.6%)
trained with a thousand times more transcribed video data (31K hours). The
lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled
data from LRS3 and combined with self-training. Using our audio-visual
representation on the same benchmark for audio-only speech recognition leads to
a 40% relative WER reduction over the state-of-the-art performance (1.3% vs
2.3%). Our code and models are available at
this https URL

AV-HuBERT 是自监督学习框架，用于从视频、音频中学习音视双方言的表征，可用于口型阅读和语音识别任务。在 433 小时的公共数据集 LRS3 上，使用 AV-HuBERT 的自我训练，口型阅读错误率降低到 26.9％，使用相同的表征进行语音识别的性能提高了 40％相对减少至 1.3％。