We present a method to edit a target portrait footage by taking a sequence of
audio as input to synthesize a photo-realistic video. This method is unique
because it is highly dynamic. It does not assume a person-specific rendering
network yet capable of translating arbitrary source audio into arbitrary video
output. Instead of learning a highly heterogeneous and nonlinear mapping from
audio to the video directly, we first factorize each target video frame into
orthogonal parameter spaces, i.e., expression, geometry, and pose, via
monocular 3D face reconstruction. Next, a recurrent network is introduced to
translate source audio into expression parameters that are primarily related to
the audio content. The audio-translated expression parameters are then used to
synthesize a photo-realistic human subject in each video frame, with the
movement of the mouth regions precisely mapped to the source audio. The
geometry and pose parameters of the target human portrait are retained,
therefore preserving the context of the original video footage. Finally, we
introduce a novel video rendering network and a dynamic programming method to
construct a temporally coherent and photo-realistic video. Extensive
experiments demonstrate the superiority of our method over existing approaches.
Our method is end-to-end learnable and robust to voice variations in the source
audio.

该研究提出了一种基于音频输入的编辑目标肖像画面的方法，通过将目标视频帧分解为表情、几何和姿势三个正交参数空间，再利用循环神经网络将源音频转化为表情参数，并在保留原始视频背景的同时合成一个逼真的人物主体，最后利用动态编程构建一个有序连贯且令人信服的逼真视频。

众人皆醉我独醒：让我依你所愿地说话

Everybody's Talkin': Let Me Talk as You Want

Synthesizing human's movements such as dancing is a flourishing research
field which has several applications in computer graphics. Recent studies have
demonstrated the advantages of deep neural networks (DNNs) for achieving
remarkable performance in motion and music tasks with little effort for feature
pre-processing. However, applying DNNs for generating dance to a piece of music
is nevertheless challenging, because of 1) DNNs need to generate large
sequences while mapping the music input, 2) the DNN needs to constraint the
motion beat to the music, and 3) DNNs require a considerable amount of
hand-crafted data. In this study, we propose a weakly supervised deep recurrent
method for real-time basic dance generation with audio power spectrum as input.
The proposed model employs convolutional layers and a multilayered Long
Short-Term memory (LSTM) to process the audio input. Then, another deep LSTM
layer decodes the target dance sequence. Notably, this end-to-end approach has
1) an auto-conditioned decode configuration that reduces accumulation of
feedback error of large dance sequence, 2) uses a contrastive cost function to
regulate the mapping between the music and motion beat, and 3) trains with weak
labels generated from the motion beat, reducing the amount of hand-crafted
data. We evaluate the proposed network based on i) the similarities between
generated and the baseline dancer motion with a cross entropy measure for large
dance sequences, and ii) accurate timing between the music and motion beat with
an F-measure. Experimental results revealed that, after training using a small
dataset, the model generates basic dance steps with low cross entropy and
maintains an F-measure score similar to that of a baseline dancer.

本研究提出了一种基于弱监督深度循环方法的，使用音频功率谱作为输入的基础舞蹈生成模型，采用卷积层和多层 LSTM 处理音频输入，并利用对比代价函数调节音乐和舞蹈节拍之间的映射，同时从舞蹈节拍生成弱标签进行模型训练，实验结果表明，该模型可以在小数据集上生成基础舞蹈步伐，并且保持与基准舞者类似的 F - 分数。

基础舞步生成的弱监督深度递归神经网络

Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step  Generation

We present a novel approach to generating photo-realistic images of a face
with accurate lip sync, given an audio input. By using a recurrent neural
network, we achieved mouth landmarks based on audio features. We exploited the
power of conditional generative adversarial networks to produce
highly-realistic face conditioned on a set of landmarks. These two networks
together are capable of producing a sequence of natural faces in sync with an
input audio track.

利用递归神经网络和条件生成对抗网络，根据音频输入生成具有准确口型同步的逼真面部图像。