To obtain a near-optimal policy with fewer interactions in Reinforcement
Learning (RL), a promising approach involves the combination of offline RL,
which enhances sample efficiency by leveraging offline datasets, and online RL,
which explores informative transitions by interacting with the environment.
Offline-to-Online (O2O) RL provides a paradigm for improving an offline trained
agent within limited online interactions. However, due to the significant
distribution shift between online experiences and offline data, most offline RL
algorithms suffer from performance drops and fail to achieve stable policy
improvement in O2O adaptation. To address this problem, we propose the Robust
Offline-to-Online (RO2O) algorithm, designed to enhance offline policies
through uncertainty and smoothness, and to mitigate the performance drop in
online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty
penalty and adversarial samples for policy and value smoothness, which enable
RO2O to maintain a consistent learning procedure in online adaptation without
requiring special changes to the learning objective. Theoretical analyses in
linear MDPs demonstrate that the uncertainty and smoothness lead to a tighter
optimality bound in O2O against distribution shift. Experimental results
illustrate the superiority of RO2O in facilitating stable offline-to-online
learning and achieving significant improvement with limited online
interactions.

提出了一种名为 Robust Offline-to-Online (RO2O) 算法的方法，通过不确定性和平滑性来增强离线策略，并在在线适应中减少性能下降，实验结果表明其在促进稳定的离线到在线学习方面具有优越性。