Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.

本文提出了一种新颖的视觉导向说话者嵌入提取器，使用自监督预训练模型和提示调整技术，从输入的视觉信息中仅生成丰富的说话者嵌入信息，并在推断时间不需要额外的音频信息。利用提取的视觉导向说话者嵌入表示，我们进一步开发了一种基于扩散的视频到语音合成模型 DiffV2S，该模型以这些说话者嵌入和从输入视频中提取的视觉表示为条件。所提出的 DiffV2S 不仅保留了输入视频帧中包含的音素细节，还创建了一个高度可理解的梅尔频谱图，在其中多个说话者的说话者身份都得到了保留。实验结果表明，DiffV2S 相较于之前的视频到语音合成技术取得了最先进的性能。