We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

本研究提出了GR-2，一个先进的通用机器人代理，旨在解决机器人操作中的可变性和广泛适应性问题。通过对3800万个视频片段进行大规模预训练，GR-2能够在多种任务和新环境中实现97.7%的成功率，展现出卓越的多任务学习和泛化能力。此项研究为机器人技术的进一步发展和实际应用提供了重要的贡献。

GR-2：一种具有网络规模知识的生成视频-语言-动作模型，用于机器人操作