Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here: https://wendison.github.io/VCVTS-demo/

本篇论文提出了一种基于跨模态知识转移的多说话人视频朗读合成系统，利用向量量化及对比预测编码来导出获得离散类音素的声学单元，利用 Lip-to-Index 网络推断声学单元的索引序列，并利用说话人编码器来产生说话人表示，以有效地控制生成语音的说话人身份。经过广泛的评估验证，该方法在生成具有高自然度、易懂度和说话人相似度的高质量语音方面具有最先进的性能。

VCVTS：通过语音转换跨模态知识转移进行多说话人视频到语音合成