Video-to-speech synthesis involves reconstructing the speech signal of a
speaker from a silent video. The implicit assumption of this task is that the
sound signal is either missing or contains a high amount of noise/corruption
such that it is not useful for processing. Previous works in the literature
either use video inputs only or employ both video and audio inputs during
training, and discard the input audio pathway during inference. In this work we
investigate the effect of using video and audio inputs for video-to-speech
synthesis during both training and inference. In particular, we use pre-trained
video-to-speech models to synthesize the missing speech signals and then train
an audio-visual-to-speech synthesis model, using both the silent video and the
synthesized speech as inputs, to predict the final reconstructed speech. Our
experiments demonstrate that this approach is successful with both raw
waveforms and mel spectrograms as target outputs.

使用视频和音频输入进行视频转语音合成的研究，通过使用预训练的视频转语音模型来合成缺失的语音信号，并训练一个音频 - 视觉 - 语音合成模型，通过同时使用静默视频和合成的语音输入来预测最终的重建语音。实验结果表明，在以原始波形和 mel 频谱图作为目标输出的情况下，这种方法是成功的。

通过生成的音频实现音频视觉视频到语音合成

Audio-visual video-to-speech synthesis with synthesized input audio

This paper presents a novel approach for generating 3D talking heads from raw
audio inputs. Our method grounds on the idea that speech related movements can
be comprehensively and efficiently described by the motion of a few control
points located on the movable parts of the face, i.e., landmarks. The
underlying musculoskeletal structure then allows us to learn how their motion
influences the geometrical deformations of the whole face. The proposed method
employs two distinct models to this aim: the first one learns to generate the
motion of a sparse set of landmarks from the given audio. The second model
expands such landmarks motion to a dense motion field, which is utilized to
animate a given 3D mesh in neutral state. Additionally, we introduce a novel
loss function, named Cosine Loss, which minimizes the angle between the
generated motion vectors and the ground truth ones. Using landmarks in 3D
talking head generation offers various advantages such as consistency,
reliability, and obviating the need for manual-annotation. Our approach is
designed to be identity-agnostic, enabling high-quality facial animations for
any users without additional data or training.

本篇研究提出了一种新方法，通过音频输入生成 3D 说话人头部动画，并利用面部的传动部位上的控制点来描述语音相关的运动，并利用两个不同的模型来实现；该方法具有身份不相关性，可实现任何用户的高质量面部动画。利用陆标在 3D 说话人头部动画生成中提供了各种优点，例如一致性，可靠性和不需要手动注释。