In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

本研究解决了安全离线强化学习中的策略优化与安全约束平衡问题，传统方法常面临性能下降或安全风险增加的困境。我们提出了一种新方法，通过条件变分自编码器学习保守安全策略，并将其转化为约束奖励回报最大化问题，以实现奖励优化和安全合规。本方法在理论分析和实证评估中表现出色，尤其在自主驾驶等复杂场景中优于现有方法。

安全离线强化学习的潜在安全约束策略方法