Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.

离线强化学习（RL）在探索可能成本高昂或不安全的真实世界应用中至关重要。然而，离线学习的策略通常是次优的，需要进一步进行在线微调。本文解决了离线到在线微调的基本困境：如果智能体保持悲观态度，可能无法学到更好的策略，而如果直接变得乐观，性能可能会突然下降。我们证明贝叶斯设计原则在解决这种困境中至关重要。智能体不应采取乐观或悲观的策略，而是应根据其对最优策略的信念采取行动。这样的概率匹配智能体可以避免性能突然下降，同时保证找到最优策略。基于我们的理论发现，我们提出了一种优于现有方法的新算法，在各种基准测试中展示了我们方法的有效性。总体而言，所提出的方法为离线到在线RL提供了一种新的视角，有潜力使离线数据的学习更加有效。

线下到线上强化学习的贝叶斯设计原则