Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.

离线到在线强化学习（O2O RL）旨在通过少量在线样本来改进离线预训练策略的性能。本文从一个新颖的角度系统研究O2O RL中仍存在的挑战，并确定性能改进缓慢和在线微调不稳定的原因在于离线预训练中准确性不高的Q值估计。为解决这个问题，我们采用了两种技术：扰动值更新和增加Q值更新的频率。我们的实验证明，提出的方法SO2显著缓解了Q值估计问题，并相对于最先进的方法改进了性能高达83.1%。

离线到线上强化学习中Q值估计的视角