Offline reinforcement learning (RL) methodologies enforce constraints on the
policy to adhere closely to the behavior policy, thereby stabilizing value
learning and mitigating the selection of out-of-distribution (OOD) actions
during test time. Conventional approaches apply identical constraints for both
value learning and test time inference. However, our findings indicate that the
constraints suitable for value estimation may in fact be excessively
restrictive for action selection during test time. To address this issue, we
propose a Mildly Constrained Evaluation Policy (MCEP) for test time inference
with a more constrained target policy for value estimation. Since the target
policy has been adopted in various prior approaches, MCEP can be seamlessly
integrated with them as a plug-in. We instantiate MCEP based on TD3-BC
[Fujimoto and Gu, 2021] and AWAC [Nair et al., 2020] algorithms. The empirical
results on MuJoCo locomotion tasks show that the MCEP significantly outperforms
the target policy and achieves competitive results to state-of-the-art offline
RL methods. The codes are open-sourced at this https URL

本研究提出了一种轻度限制的评估策略 (MCEP)，用于测试时推断，并基于 TD3-BC 和 AWAC 算法进行实例化，在 MuJoCo 运动任务上取得了竞争性结果。

针对离线强化学习的轻度限制评估策略

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Learning to perform tasks by leveraging a dataset of expert observations,
also known as imitation learning from observations (ILO), is an important
paradigm for learning skills without access to the expert reward function or
the expert actions. We consider ILO in the setting where the expert and the
learner agents operate in different environments, with the source of the
discrepancy being the transition dynamics model. Recent methods for scalable
ILO utilize adversarial learning to match the state-transition distributions of
the expert and the learner, an approach that becomes challenging when the
dynamics are dissimilar. In this work, we propose an algorithm that trains an
intermediary policy in the learner environment and uses it as a surrogate
expert for the learner. The intermediary policy is learned such that the state
transitions generated by it are close to the state transitions in the expert
dataset. To derive a practical and scalable algorithm, we employ concepts from
prior work on estimating the support of a probability distribution. Experiments
using MuJoCo locomotion tasks highlight that our method compares favorably to
the baselines for ILO with transition dynamics mismatch.

本文提出了一种利用中间策略来训练学习者的一种方法，该中间策略可以近似地执行专家的策略，以便用于不同环境下的模仿学习，并在 MuJoCo 运动任务中取得了良好的结果。