Recent advancements in offline Reinforcement Learning (Offline RL) have led
to an increased focus on methods based on conservative policy updates to
address the Out-of-Distribution (OOD) issue. These methods typically involve
adding behavior regularization or modifying the critic learning objective,
focusing primarily on states or actions with substantial dataset support.
However, we challenge this prevailing notion by asserting that the absence of
an action or state from a dataset does not necessarily imply its suboptimality.
In this paper, we propose a novel approach to tackle the OOD problem. We
introduce an offline RL teacher-student framework, complemented by a policy
similarity measure. This framework enables the student policy to gain insights
not only from the offline RL dataset but also from the knowledge transferred by
a teacher policy. The teacher policy is trained using another dataset
consisting of state-action pairs, which can be viewed as practical domain
knowledge acquired without direct interaction with the environment. We believe
this additional knowledge is key to effectively solving the OOD issue. This
research represents a significant advancement in integrating a teacher-student
network into the actor-critic framework, opening new avenues for studies on
knowledge transfer in offline RL and effectively addressing the OOD challenge.

该研究提出了一种解决离线强化学习中的 OOD 问题的新方法，通过引入离线强化学习师生框架和策略相似度度量，使得学生策略不仅可以从离线数据集中获取见解，还可以从教师策略传递的知识中获得额外的信息，从而有效解决 OOD 问题。

使用未标记数据增强离线强化学习

Augmenting Offline RL with Unlabeled Data

Offline reinforcement learning suffers from the out-of-distribution issue and
extrapolation error. Most policy constraint methods regularize the density of
the trained policy towards the behavior policy, which is too restrictive in
most cases. We propose Supported Trust Region optimization (STR) which performs
trust region policy optimization with the policy constrained within the support
of the behavior policy, enjoying the less restrictive support constraint. We
show that, when assuming no approximation and sampling error, STR guarantees
strict policy improvement until convergence to the optimal support-constrained
policy in the dataset. Further with both errors incorporated, STR still
guarantees safe policy improvement for each step. Empirical results validate
the theory of STR and demonstrate its state-of-the-art performance on MuJoCo
locomotion domains and much more challenging AntMaze domains.

在离线强化学习中，基于行为策略的支持约束的支持下的支持信任区域优化（STR）保证了严格的策略改进，并在包括近似误差和采样误差的情况下保证步骤的安全策略改进，其理论和实证结果验证了其在 MuJoCo 运动领域和具有更具挑战性的 AntMaze 领域的卓越性能。