In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

本文针对离线强化学习中使用静态数据集导致的政策生成不在分布内的问题，提出了一种新的方法——约束潜在动作策略（C-LAP）。通过学习观察与动作的联合分布生成模型，将政策学习视为一个受限目标，有效地消除了对贝尔曼更新的额外不确定性惩罚需求，并显著减少了学习政策所需的梯度步骤。实验表明，C-LAP与先进方法具有竞争力，特别是在具有视觉观察的数据集上表现优异。

基于约束潜在动作策略的模型驱动离线强化学习