Offline-to-online Reinforcement Learning (O2O RL) aims to improve the
performance of offline pretrained policy using only a few online samples. Built
on offline RL algorithms, most O2O methods focus on the balance between RL
objective and pessimism, or the utilization of offline and online samples. In
this paper, from a novel perspective, we systematically study the challenges
that remain in O2O RL and identify that the reason behind the slow improvement
of the performance and the instability of online finetuning lies in the
inaccurate Q-value estimation inherited from offline pretraining. Specifically,
we demonstrate that the estimation bias and the inaccurate rank of Q-value
cause a misleading signal for the policy update, making the standard offline RL
algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based
on this observation, we address the problem of Q-value estimation by two
techniques: (1) perturbed value update and (2) increased frequency of Q-value
updates. The first technique smooths out biased Q-value estimation with sharp
peaks, preventing early-stage policy exploitation of sub-optimal actions. The
second one alleviates the estimation bias inherited from offline pretraining by
accelerating learning. Extensive experiments on the MuJoco and Adroit
environments demonstrate that the proposed method, named SO2, significantly
alleviates Q-value estimation issues, and consistently improves the performance
against the state-of-the-art methods by up to 83.1%.

离线到在线强化学习（O2O RL）旨在通过少量在线样本来改进离线预训练策略的性能。本文从一个新颖的角度系统研究 O2O RL 中仍存在的挑战，并确定性能改进缓慢和在线微调不稳定的原因在于离线预训练中准确性不高的 Q 值估计。为解决这个问题，我们采用了两种技术：扰动值更新和增加 Q 值更新的频率。我们的实验证明，提出的方法 SO2 显著缓解了 Q 值估计问题，并相对于最先进的方法改进了性能高达 83.1%。

离线到线上强化学习中 Q 值估计的视角

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement  Learning

The divergence of the Q-value estimation has been a prominent issue in
offline RL, where the agent has no access to real dynamics. Traditional beliefs
attribute this instability to querying out-of-distribution actions when
bootstrapping value targets. Though this issue can be alleviated with policy
constraints or conservative Q estimation, a theoretical understanding of the
underlying mechanism causing the divergence has been absent. In this work, we
aim to thoroughly comprehend this mechanism and attain an improved solution. We
first identify a fundamental pattern, self-excitation, as the primary cause of
Q-value estimation divergence in offline RL. Then, we propose a novel
Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel
(NTK) to measure the evolving property of Q-network at training, which provides
an intriguing explanation of the emergence of divergence. For the first time,
our theory can reliably decide whether the training will diverge at an early
stage, and even predict the order of the growth for the estimated Q-value, the
model's norm, and the crashing step when an SGD optimizer is used. The
experiments demonstrate perfect alignment with this theoretic analysis.
Building on our insights, we propose to resolve divergence from a novel
perspective, namely improving the model's architecture for better extrapolating
behavior. Through extensive empirical studies, we identify LayerNorm as a good
solution to effectively avoid divergence without introducing detrimental bias,
leading to superior performance. Experimental results prove that it can still
work in some most challenging settings, i.e. using only 1 transitions of the
dataset, where all previous methods fail. Moreover, it can be easily plugged
into modern offline RL methods and achieve SOTA results on many challenging
tasks. We also give unique insights into its effectiveness.

在离线增强学习中，离线 Q 值估计的发散问题一直是一个突出的问题。本研究通过对机制的全面理解和对模型架构的改进，提出了解决发散问题的新途径，其中包括基于离线 RL 的自激励模式和通过 LayerNorm 架构提升性能。

离线强化学习中 Q 值离散度的理解、预测和改善

Understanding, Predicting and Better Resolving Q-Value Divergence in  Offline-RL

The research on deep reinforcement learning which estimates Q-value by deep
learning has been attracted the interest of researchers recently. In deep
reinforcement learning, it is important to efficiently learn the experiences
that an agent has collected by exploring environment. We propose NEC2DQN that
improves learning speed of a poor sample efficiency algorithm such as DQN by
using good one such as NEC at the beginning of learning. We show it is able to
learn faster than Double DQN or N-step DQN in the experiments of Pong.

NEC2DQN 旨在提高深度强化学习中 DQN 等低效算法的学习速度，通过在学习开始时使用高效率算法 NEC，能够比 Double DQN 或 N-step DQN 更快地在 Pong 实验中进行学习。