A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

在这篇论文中，我们提出了一种视觉-运动策略学习框架，该框架在给定任务的人类示范中对视频扩散模型进行微调。在测试阶段，我们生成了一个以新颖场景的图像为条件的任务执行示例，并直接使用这个合成的执行结果来控制机器人。我们的主要观点是，使用常用工具可以轻松地弥合人手和机器人操作者之间的具身隔阂。我们在四个复杂度不断增加的任务上评估了我们的方法，并证明利用互联网规模的生成模型使得学习策略可以比现有行为克隆方法实现更高程度的泛化。

Dreamitate: 通过视频生成进行真实世界视觉运动策略学习