The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that have broad generalization. Prior works have explored visual pre-training with different self-supervised objectives, but the generalization capabilities of the learned policies remain relatively unknown. In this work, we take the first step towards this challenge, focusing on how pre-trained representations can help the generalization of the learned policies. We first identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning. We then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we demonstrate significantly better categorical generalization compared to prior approaches in imitation learning settings.

本研究通过使用预训练表示来改善策略学习中的范畴化概括能力，提出了一种新的双流架构SpawnNet，通过将预训练的多层表示融合到另一个网络中学习鲁棒策略，实验证明了在模仿学习环境中相较以往方法具有显著更好的范畴化概括。

SpawnNet: 从预训练网络中学习通用可视动作技能