Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

本研究聚焦于机器人学习面临的数据、概括性和鲁棒性挑战，探索特别的机器人基础模型如何克服这些障碍。提出了一种基于预训练的视觉-语言模型的新流匹配架构，能够有效执行复杂和灵活的任务。研究结果显示，该模型在无监督学习下能够立即执行多种任务，并通过微调掌握新技能，对推进通用机器人控制具有重要影响。

π₀：用于通用机器人控制的视觉-语言-行动流模型