In this paper we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We show that this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. Our method, data and evaluation approach are made available to facilitate further study of generalization in Imitation Learners.

利用自我监督的视觉变换模型及其新出的语义能力，通过聚类外观特征来形成稳定的关键点，从而改善模仿学习策略的泛化能力。本论文介绍了BC-ViT，一种利用富含DINO预训练视觉变换器（ViT）补丁级嵌入的模仿学习算法，以通过示范获取更好的泛化效果。通过对一个多样化的物体操作任务数据集进行模仿学习的评估，证明了这种表示方式能够实现广义行为。为了促进对于模仿学习中泛化问题的进一步研究，我们提供了我们的方法、数据和评估方法。

基于预训练表示的可推广模仿学习