Recently, it has been shown that for offline deep reinforcement learning
(DRL), pre-training Decision Transformer with a large language corpus can
improve downstream performance (Reid et al., 2022). A natural question to ask
is whether this performance gain can only be achieved with language
pre-training, or can be achieved with simpler pre-training schemes which do not
involve language. In this paper, we first show that language is not essential
for improved performance, and indeed pre-training with synthetic IID data for a
small number of updates can match the performance gains from pre-training with
a large language corpus; moreover, pre-training with data generated by a
one-step Markov chain can further improve the performance. Inspired by these
experimental results, we then consider pre-training Conservative Q-Learning
(CQL), a popular offline DRL algorithm, which is Q-learning-based and typically
employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training
with simple synthetic data for a small number of updates can also improve CQL,
providing consistent performance improvement on D4RL Gym locomotion datasets.
The results of this paper not only illustrate the importance of pre-training
for offline DRL but also show that the pre-training data can be synthetic and
generated with remarkably simple mechanisms.

最近的研究表明，对于离线深度强化学习，通过在大型语言语料库中对决策 Transformer 进行预训练可以提高下游性能。本文首先证明了语言对于提升性能并非必要，事实上，通过对一小部分迭代进行合成 IID 数据的预训练即可与大型语言语料库的预训练相匹配；此外，使用一步马尔科夫链生成的数据进行预训练还可以进一步提高性能。受这些实验结果的启发，本文还考虑了保守 Q 学习（CQL）的预训练，它是一种基于 Q 学习的离线深度强化学习算法，通常使用多层感知机（MLP）骨干网络。令人惊讶的是，通过对一小部分迭代使用简单的合成数据进行预训练也可以改善 CQL，在 D4RL Gym 运动数据集上提供持续的性能改进。本文的结果不仅说明了离线深度强化学习中预训练的重要性，还表明预训练数据可以是合成的，并通过非常简单的机制生成。