Offline reinforcement learning (RL) is crucial for real-world applications
where exploration can be costly or unsafe. However, offline learned policies
are often suboptimal, and further online fine-tuning is required. In this
paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if
the agent remains pessimistic, it may fail to learn a better policy, while if
it becomes optimistic directly, performance may suffer from a sudden drop. We
show that Bayesian design principles are crucial in solving such a dilemma.
Instead of adopting optimistic or pessimistic policies, the agent should act in
a way that matches its belief in optimal policies.
Such a probability-matching agent can avoid a sudden performance drop while
still being guaranteed to find the optimal policy. Based on our theoretical
findings, we introduce a novel algorithm that outperforms existing methods on
various benchmarks, demonstrating the efficacy of our approach. Overall, the
proposed approach provides a new perspective on offline-to-online RL that has
the potential to enable more effective learning from offline data.

离线强化学习（RL）在探索可能成本高昂或不安全的真实世界应用中至关重要。然而，离线学习的策略通常是次优的，需要进一步进行在线微调。本文解决了离线到在线微调的基本困境：如果智能体保持悲观态度，可能无法学到更好的策略，而如果直接变得乐观，性能可能会突然下降。我们证明贝叶斯设计原则在解决这种困境中至关重要。智能体不应采取乐观或悲观的策略，而是应根据其对最优策略的信念采取行动。这样的概率匹配智能体可以避免性能突然下降，同时保证找到最优策略。基于我们的理论发现，我们提出了一种优于现有方法的新算法，在各种基准测试中展示了我们方法的有效性。总体而言，所提出的方法为离线到在线 RL 提供了一种新的视角，有潜力使离线数据的学习更加有效。

线下到线上强化学习的贝叶斯设计原则

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

To obtain a near-optimal policy with fewer interactions in Reinforcement
Learning (RL), a promising approach involves the combination of offline RL,
which enhances sample efficiency by leveraging offline datasets, and online RL,
which explores informative transitions by interacting with the environment.
Offline-to-Online (O2O) RL provides a paradigm for improving an offline trained
agent within limited online interactions. However, due to the significant
distribution shift between online experiences and offline data, most offline RL
algorithms suffer from performance drops and fail to achieve stable policy
improvement in O2O adaptation. To address this problem, we propose the Robust
Offline-to-Online (RO2O) algorithm, designed to enhance offline policies
through uncertainty and smoothness, and to mitigate the performance drop in
online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty
penalty and adversarial samples for policy and value smoothness, which enable
RO2O to maintain a consistent learning procedure in online adaptation without
requiring special changes to the learning objective. Theoretical analyses in
linear MDPs demonstrate that the uncertainty and smoothness lead to a tighter
optimality bound in O2O against distribution shift. Experimental results
illustrate the superiority of RO2O in facilitating stable offline-to-online
learning and achieving significant improvement with limited online
interactions.

提出了一种名为 Robust Offline-to-Online (RO2O) 算法的方法，通过不确定性和平滑性来增强离线策略，并在在线适应中减少性能下降，实验结果表明其在促进稳定的离线到在线学习方面具有优越性。

基于不确定性和平滑性的稳健离线到在线强化学习

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty  and Smoothness

Offline reinforcement learning (RL) is a learning paradigm where an agent
learns from a fixed dataset of experience. However, learning solely from a
static dataset can limit the performance due to the lack of exploration. To
overcome it, offline-to-online RL combines offline pre-training with online
fine-tuning, which enables the agent to further refine its policy by
interacting with the environment in real-time. Despite its benefits, existing
offline-to-online RL methods suffer from performance degradation and slow
improvement during the online phase. To tackle these challenges, we propose a
novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing
the number of Q-networks, we seamlessly bridge offline pre-training and online
fine-tuning without degrading performance. Moreover, to expedite online
performance enhancement, we appropriately loosen the pessimism of Q-value
estimation and incorporate ensemble-based exploration mechanisms into our
framework. Experimental results demonstrate that E2O can substantially improve
the training stability, learning efficiency, and final performance of existing
offline RL methods during online fine-tuning on a range of locomotion and
navigation tasks, significantly outperforming existing offline-to-online RL
methods.

提出了一种名为 “Ensemble-based Offline-to-Online（E2O）RL” 的新框架，通过增加 Q 网络的数量，能够无损地桥接离线预训练和在线微调，同时通过松弛 Q 值估计的悲观主义和合理利用集合探索机制，加快了在线性能增强，显著优于现有的离线到在线 RL 方法，能够在一系列运动和导航任务的在线微调过程中极大地提高现有离线 RL 方法的训练稳定性，学习效率和最终性能。