Significant progress has been made in speaker dependent Lip-to-Speech
synthesis, which aims to generate speech from silent videos of talking faces.
Current state-of-the-art approaches primarily employ non-autoregressive
sequence-to-sequence architectures to directly predict mel-spectrograms or
audio waveforms from lip representations. We hypothesize that the direct
mel-prediction hampers training/model efficiency due to the entanglement of
speech content with ambient information and speaker characteristics. To this
end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
First, a non-autoregressive sequence-to-sequence model maps self-supervised
visual features to a representation of disentangled speech content. A vocoder
then converts the speech features into raw waveforms. Extensive evaluations
confirm the effectiveness of our setup, achieving state-of-the-art performance
on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT
datasets. Speech samples from RobustL2S can be found at
this https URL

RobustL2S 是一种模块化的 Lip-to-Speech 合成框架，通过自监督学习对 Lip 形象进行映射，获得一种解耦的语音内容特征，再利用 vocoder 将语音特征转化为原始的声波信号，实现了在多个数据集上的最佳表现。

RobustL2S: 利用自监督表示技术进行说话人特异性的唇语到语音合成

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting  Self-Supervised Representations

In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework,
for synthesizing intelligible speech from a silent lip movement video.
Specifically, to complement the insufficient supervisory signal of the previous
L2S model, we propose to use quantized self-supervised speech representations,
named speech units, as an additional prediction target for the L2S model.
Therefore, the proposed L2S model is trained to generate multiple targets,
mel-spectrogram and speech units. As the speech units are discrete while
mel-spectrogram is continuous, the proposed multi-target L2S model can be
trained with strong content supervision, without using text-labeled data.
Moreover, to accurately convert the synthesized mel-spectrogram into a
waveform, we introduce a multi-input vocoder that can generate a clear waveform
even from blurry and noisy mel-spectrogram by referring to the speech units.
Extensive experimental results confirm the effectiveness of the proposed method
in L2S.

本文提出了一种新型 Lip-to-Speech 合成（L2S）框架，采用量化的自监督语音表示作为 L2S 模型的另一个预测目标，从而实现了强内容监督的多目标 L2S 模型训练，并介绍了一种多输入声码器用于准确地将合成的梅尔频谱转换为波形，并经过实验证实了该方法在 L2S 领域的有效性。

使用语音单元的可懂嘴唇合成

Intelligible Lip-to-Speech Synthesis with Speech Units

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches
from silent videos of talking faces with no restriction on head poses or
vocabulary. Current works mainly use sequence-to-sequence models to solve this
problem, either in an autoregressive architecture or a flow-based
non-autoregressive architecture. However, these models suffer from several
drawbacks: 1) Instead of directly generating audios, they use a two-stage
pipeline that first generates mel-spectrograms and then reconstructs audios
from the spectrograms. This causes cumbersome deployment and degradation of
speech quality due to error propagation; 2) The audio reconstruction algorithm
used by these models limits the inference speed and audio quality, while neural
vocoders are not available for these models since their output spectrograms are
not accurate enough; 3) The autoregressive model suffers from high inference
latency, while the flow-based model has high memory occupancy: neither of them
is efficient enough in both time and memory usage. To tackle these problems, we
propose FastLTS, a non-autoregressive end-to-end model which can directly
synthesize high-quality speech audios from unconstrained talking videos with
low latency, and has a relatively small model size. Besides, different from the
widely used 3D-CNN visual frontend for lip movement encoding, we for the first
time propose a transformer-based visual frontend for this task. Experiments
show that our model achieves $19.76\times$ speedup for audio waveform
generation compared with the current autoregressive model on input sequences of
3 seconds, and obtains superior audio quality.

提出了一种基于 transformer 的视觉前端的快速非自回归模型 FastLTS，可以从任意姿态和词汇的肢体语言视频中进行高质量音频合成，比当前的自回归模型在 3 秒输入序列上实现了 19.76 倍的速度提升，并获得了更好的音频质量。