Leveraging physical knowledge described by partial differential equations
(PDEs) is an appealing way to improve unsupervised video prediction methods.
Since physics is too restrictive for describing the full visual content of
generic videos, we introduce PhyDNet, a two-branch deep architecture, which
explicitly disentangles PDE dynamics from unknown complementary information. A
second contribution is to propose a new recurrent physical cell (PhyCell),
inspired from data assimilation techniques, for performing PDE-constrained
prediction in latent space. Extensive experiments conducted on four various
datasets show the ability of PhyDNet to outperform state-of-the-art methods.
Ablation studies also highlight the important gain brought out by both
disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet
presents interesting features for dealing with missing data and long-term
forecasting.

介绍了一种两分支深度体系结构（PhyDNet）和新的递归物理单元（PhyCell），用于利用 PDE 描述的物理知识改进无监督视频预测方法，并且在四个不同的数据集上进行了广泛实验，表明了 PhyDNet 超越了现有方法的能力。

无监督视频预测中解开物理动力学和未知因素的区别

Disentangling Physical Dynamics from Unknown Factors for Unsupervised  Video Prediction

Video captioning, the task of describing the content of a video, has seen
some promising improvements in recent years with sequence-to-sequence models,
but accurately learning the temporal and logical dynamics involved in the task
still remains a challenge, especially given the lack of sufficient annotated
data. We improve video captioning by sharing knowledge with two related
directed-generation tasks: a temporally-directed unsupervised video prediction
task to learn richer context-aware video encoder representations, and a
logically-directed language entailment generation task to learn better
video-entailed caption decoder representations. For this, we present a
many-to-many multi-task learning model that shares parameters across the
encoders and decoders of the three tasks. We achieve significant improvements
and the new state-of-the-art on several standard video captioning datasets
using diverse automatic and human evaluations. We also show mutual multi-task
improvements on the entailment generation task.

通过多任务学习模型，结合无监督视频预测和语言蕴涵生成任务，共享参数学习提取更丰富的视频编码器表示和更好的视频 - 标题解码器表示，显著提高视频字幕生成的性能，达到了多个标准数据集的最新水平。