Unconstrained lip-to-speech synthesis aims to generate corresponding speeches
from silent videos of talking faces with no restriction on head poses or
vocabulary. Current works mainly use sequence-to-sequence models to solve this
problem, either in an autoregressive architecture or a flow-based
non-autoregressive architecture. However, these models suffer from several
drawbacks: 1) Instead of directly generating audios, they use a two-stage
pipeline that first generates mel-spectrograms and then reconstructs audios
from the spectrograms. This causes cumbersome deployment and degradation of
speech quality due to error propagation; 2) The audio reconstruction algorithm
used by these models limits the inference speed and audio quality, while neural
vocoders are not available for these models since their output spectrograms are
not accurate enough; 3) The autoregressive model suffers from high inference
latency, while the flow-based model has high memory occupancy: neither of them
is efficient enough in both time and memory usage. To tackle these problems, we
propose FastLTS, a non-autoregressive end-to-end model which can directly
synthesize high-quality speech audios from unconstrained talking videos with
low latency, and has a relatively small model size. Besides, different from the
widely used 3D-CNN visual frontend for lip movement encoding, we for the first
time propose a transformer-based visual frontend for this task. Experiments
show that our model achieves $19.76\times$ speedup for audio waveform
generation compared with the current autoregressive model on input sequences of
3 seconds, and obtains superior audio quality.

提出了一种基于 transformer 的视觉前端的快速非自回归模型 FastLTS，可以从任意姿态和词汇的肢体语言视频中进行高质量音频合成，比当前的自回归模型在 3 秒输入序列上实现了 19.76 倍的速度提升，并获得了更好的音频质量。