Recent work has demonstrated both benefits and limitations from using
supervised approaches (without temporal-difference learning) for offline
reinforcement learning. While off-policy reinforcement learning provides a
promising approach for improving performance beyond supervised approaches, we
observe that training is often inefficient and unstable due to temporal
difference bootstrapping. In this paper we propose a best-of-both approach by
first learning the behavior policy and critic with supervised learning, before
improving with off-policy reinforcement learning. Specifically, we demonstrate
improved efficiency by pre-training with a supervised Monte-Carlo value-error,
making use of commonly neglected downstream information from the provided
offline trajectories. We find that we are able to more than halve the training
time of the considered offline algorithms on standard benchmarks, and
surprisingly also achieve greater stability. We further build on the importance
of having consistent policy and value functions to propose novel hybrid
algorithms, TD3+BC+CQL and EDAC+BC, that regularize both the actor and the
critic towards the behavior policy. This helps to more reliably improve on the
behavior policy when learning from limited human demonstrations. Code is
available at this https URL

最近的研究表明，在离线强化学习中使用有监督方法（不使用时序差分学习）既有益处又有局限性。本文提出了一种取长补短的方法，首先通过有监督学习来学习行为策略和评论家，然后再通过离线强化学习进行改进。具体而言，我们通过使用常被忽视的提供的离线轨迹中的下游信息，通过有监督的蒙特卡洛值误差预训练，提高了效率。我们发现在标准基准测试中，我们能够将考虑的离线算法的训练时间减少一半以上，并且出人意料地获得更大的稳定性。我们进一步强调具有一致的策略和值函数的重要性，提出了新颖的混合算法 TD3+BC+CQL 和 EDAC+BC，对行为策略和评论家进行正则化，更可靠地改进行为策略。代码可在此 URL 找到。