Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild. Results, visualizations and videos at https://robo-affordances.github.io/

本论文探讨了如何通过利用互联网视频中的人类行为，训练一个可视化驱动的视觉能力模型，以此实现机器人在现实环境中的复杂任务执行。我们将该模型与四个机器人学习范式无缝连接，并在4个现实世界环境、超过10种不同任务和2个机器人平台中展示了其效力。

以人类视频中的操作性作为机器人的通用表现形式