Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.

本研究针对利用无标注视频数据进行强化学习预训练中的挑战，提出了一种名为预训练视觉动态表示（PVDR）的新方法。通过采用视频预测作为预训练任务，我们利用基于Transformer的条件变分自编码器（CVAE）学习视频中的视觉动态表示，从而有效缩小视频与下游任务之间的领域差距，促进政策学习的效率。实验结果表明，PVDR能够有效提升基于视频的预训练效果。

用于高效策略学习的预训练视觉动态表示