How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

本研究解决了机器人操作政策在未见物体类型和新动作任务中的泛化问题。通过预测网络数据中的运动信息，采用人类视频生成的方法来指导机器人操作，展示了如何利用轻松获取的网络数据训练的生成模型，使机器人能够完成之前未接触过的任务。我们的实验结果表明，该方法显著提高了机器人在多种真实场景中的操作能力。

Gen2Act：人类视频生成在新场景下实现可泛化的机器人操作