Offline reinforcement learning (RL) is a learning paradigm where an agent
learns from a fixed dataset of experience. However, learning solely from a
static dataset can limit the performance due to the lack of exploration. To
overcome it, offline-to-online RL combines offline pre-training with online
fine-tuning, which enables the agent to further refine its policy by
interacting with the environment in real-time. Despite its benefits, existing
offline-to-online RL methods suffer from performance degradation and slow
improvement during the online phase. To tackle these challenges, we propose a
novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing
the number of Q-networks, we seamlessly bridge offline pre-training and online
fine-tuning without degrading performance. Moreover, to expedite online
performance enhancement, we appropriately loosen the pessimism of Q-value
estimation and incorporate ensemble-based exploration mechanisms into our
framework. Experimental results demonstrate that E2O can substantially improve
the training stability, learning efficiency, and final performance of existing
offline RL methods during online fine-tuning on a range of locomotion and
navigation tasks, significantly outperforming existing offline-to-online RL
methods.

提出了一种名为 “Ensemble-based Offline-to-Online（E2O）RL” 的新框架，通过增加 Q 网络的数量，能够无损地桥接离线预训练和在线微调，同时通过松弛 Q 值估计的悲观主义和合理利用集合探索机制，加快了在线性能增强，显著优于现有的离线到在线 RL 方法，能够在一系列运动和导航任务的在线微调过程中极大地提高现有离线 RL 方法的训练稳定性，学习效率和最终性能。