Creating realistic human videos introduces the challenge of being able to simultaneously generate both appearance, as well as motion. To tackle this challenge, we propose the novel spatio-temporal GAN-architecture $G^3AN$, which seeks to capture the distribution of high dimensional video data and to model appearance and motion in disentangled manner. The latter is achieved by decomposing appearance and motion in a three-stream Generator, where the main stream aims to model spatio-temporal consistency, whereas the two auxiliary streams augment the main stream with multi-scale appearance and motion features, respectively. An extensive quantitative and qualitative analysis shows that our model systematically and significantly outperforms state-of-the-art methods on the facial expression datasets MUG and UvA-NEMO, as well as the Weizmann and UCF101 datasets on human action. Additional analysis on the learned latent representations confirms the successful decomposition of appearance and motion.

提出了一种名为G$^{3}$AN的新型时空生成模型，用于捕捉高维视频数据的分布，并以分离的方式模拟外观和动作，在面部表情数据集MUG和UvA-NEMO，以及人类行动数据集Weizmann和UCF101上显著优于现有方法，并分析了学习到的潜在表示的成功分解。

G3AN: 视频生成中外观和运动的分离