Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points. Inspired by the natural way humans think, we propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual or semantic similarities from the affordance memory. The next step is to map the contact points of the retrieved objects to the new object. While establishing this correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate object categories. Through the Robo-ABC framework, robots may generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.

通过从人类视频中提取联系点、并借鉴人类思维方式，我们提出了 Robo-ABC 框架，在不需要任何手动注释、附加训练、部分分割、预编码知识或视角限制的情况下，使机器人能够通过检索视觉或语义上相似的对象来获得关于操作性的信息，并将其映射到新对象上，从而实现对类别之外的对象的零样本操作。在视觉操作性检索上，Robo-ABC 达到了相对于最先进的端到端操作模型的31.6%的显著提高，并通过现实世界的物体抓取任务实验，取得了85.7%的成功率，证明了其在真实世界任务中的能力。

Robo-ABC: 通过语义对应实现机器人操作中的类别以外物体能力泛化