Offline reinforcement learning (RL) is a learning paradigm where an agent
learns from a fixed dataset of experience. However, learning solely from a
static dataset can limit the performance due to the lack of exploration. To
overcome it, offline-to-online RL combines offline pre-training with online
fine-tuning, which enables the agent to further refine its policy by
interacting with the environment in real-time. Despite its benefits, existing
offline-to-online RL methods suffer from performance degradation and slow
improvement during the online phase. To tackle these challenges, we propose a
novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing
the number of Q-networks, we seamlessly bridge offline pre-training and online
fine-tuning without degrading performance. Moreover, to expedite online
performance enhancement, we appropriately loosen the pessimism of Q-value
estimation and incorporate ensemble-based exploration mechanisms into our
framework. Experimental results demonstrate that E2O can substantially improve
the training stability, learning efficiency, and final performance of existing
offline RL methods during online fine-tuning on a range of locomotion and
navigation tasks, significantly outperforming existing offline-to-online RL
methods.

提出了一种名为 “Ensemble-based Offline-to-Online（E2O）RL” 的新框架，通过增加 Q 网络的数量，能够无损地桥接离线预训练和在线微调，同时通过松弛 Q 值估计的悲观主义和合理利用集合探索机制，加快了在线性能增强，显著优于现有的离线到在线 RL 方法，能够在一系列运动和导航任务的在线微调过程中极大地提高现有离线 RL 方法的训练稳定性，学习效率和最终性能。

基于集成的离线到在线强化学习：从悲观学习到乐观探索

Ensemble-based Offline-to-Online Reinforcement Learning: From  Pessimistic Learning to Optimistic Exploration

Extracting action sequences from natural language texts is challenging, as it
requires commonsense inferences based on world knowledge. Although there has
been work on extracting action scripts, instructions, navigation actions, etc.,
they require that either the set of candidate actions be provided in advance,
or that action descriptions are restricted to a specific form, e.g.,
description templates. In this paper, we aim to extract action sequences from
texts in free natural language, i.e., without any restricted templates,
provided the candidate set of actions is unknown. We propose to extract action
sequences from texts based on the deep reinforcement learning framework.
Specifically, we view "selecting" or "eliminating" words from texts as
"actions", and the texts associated with actions as "states". We then build
Q-networks to learn the policy of extracting actions and extract plans from the
labeled texts. We demonstrate the effectiveness of our approach on several
datasets with comparison to state-of-the-art approaches, including online
experiments interacting with humans.

本文运用基于深度强化学习的 Q-networks 模型，以自然语言文本为基础，从中无限制提取行动序列，通过在线实验与现有技术进行比较，证明了本方法的有效性。

基于深度强化学习从文本中提取动作序列

Extracting Action Sequences from Texts Based on Deep Reinforcement  Learning

We study reinforcement learning (RL) in high dimensional episodic Markov
decision processes (MDP). We consider value-based RL when the optimal Q-value
is a linear function of d-dimensional state-action feature representation. For
instance, in deep-Q networks (DQN), the Q-value is a linear function of the
feature representation layer (output layer). We propose two algorithms, one
based on optimism, LINUCB, and another based on posterior sampling, LINPSRL. We
guarantee frequentist and Bayesian regret upper bounds of O(d sqrt{T}) for
these two algorithms, where T is the number of episodes. We extend these
methods to deep RL and propose Bayesian deep Q-networks (BDQN), which uses an
efficient Thompson sampling algorithm for high dimensional RL. We deploy the
double DQN (DDQN) approach, and instead of learning the last layer of Q-network
using linear regression, we use Bayesian linear regression, resulting in an
approximated posterior over Q-function. This allows us to directly incorporate
the uncertainty over the Q-function and deploy Thompson sampling on the learned
posterior distribution resulting in efficient exploration/exploitation
trade-off. We empirically study the behavior of BDQN on a wide range of Atari
games. Since BDQN carries out more efficient exploration and exploitation, it
is able to reach higher return substantially faster compared to DDQN.

这篇论文研究了高维情境下的强化学习，提出了两种基于乐观法和后验采样的算法来解决此问题，并扩展了该方法应用在深度强化学习上，所提出的贝叶斯深度 Q 网络通过采用贝叶斯线性回归的方法调整 Q-networks 的学习方式，使其能够充分平衡探索与执行间的权衡，更加有效地应用在 Atari 游戏中。