This paper investigates a novel task of talking face video generation solely
from speeches. The speech-to-video generation technique can spark interesting
applications in entertainment, customer service, and human-computer-interaction
industries. Indeed, the timbre, accent and speed in speeches could contain rich
information relevant to speakers' appearance. The challenge mainly lies in
disentangling the distinct visual attributes from audio signals. In this
article, we propose a light-weight, cross-modal distillation method to extract
disentangled emotional and identity information from unlabelled video inputs.
The extracted features are then integrated by a generative adversarial network
into talking face video clips. With carefully crafted discriminators, the
proposed framework achieves realistic generation results. Experiments with
observed individuals demonstrated that the proposed framework captures the
emotional expressions solely from speeches, and produces spontaneous facial
motion in the video output. Compared to the baseline method where speeches are
combined with a static image of the speaker, the results of the proposed
framework is almost indistinguishable. User studies also show that the proposed
method outperforms the existing algorithms in terms of emotion expression in
the generated videos.

这篇研究论文介绍了一种仅基于语音生成说话脸部视频的全新方法，并提出了一种轻量级的跨模态蒸馏方法，这种方法能够从未标记的视频输入中提取出情感和身份信息，然后使用对抗生成网络将提取的特征整合到说话脸部视频片段中，实验结果表明这一提出的框架能够从语音中捕获情感表达，生成的视频具有自发的面部动作，且在情感表达方面优于已有的算法。