A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 50,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action. Our experiments show that our proposed method not only produces more accurate video predictions, but also more accurately predicts object motion, when compared to prior methods.

开发了一个动作条件视频预测模型，能够显式地模拟像素运动，从而学习关于物理对象运动的知识。同时，模型对对象外貌部分不变，可对以前未见过的对象进行推广。我们介绍了一个包含推动动作的59,000个机器人交互数据集，包括一个具有新颖对象的测试集。实验结果表明，与现有方法相比，我们的方法在定量和定性方面都能更准确地预测视频。

通过视频预测进行物理交互的无监督学习