Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The performance of XSkill is best understood from the anonymous website: https://xskillcorl.github.io.

在这篇论文中，介绍了一种名为XSkill的模仿学习框架，通过从未标记的人类和机器人操纵视频中发现一种跨体现性表示，使用条件扩散策略将该表示转移到机器人行为，并通过组合学习到的技能完成由人类提示视频指定的未见任务，进而解决了直接从非结构化人类视频中提取可重用机器人操作技能的挑战。实验结果表明，XSkill的性能更好。

XSkill：跨体现技能发现