Offline Reinforcement Learning (RL) endeavors to leverage offline datasets to
craft effective agent policy without online interaction, which imposes proper
conservative constraints with the support of behavior policies to tackle the
Out-Of-Distribution (OOD) problem. However, existing works often suffer from
the constraint conflict issue when offline datasets are collected from multiple
behavior policies, i.e., different behavior policies may exhibit inconsistent
actions with distinct returns across the state space. To remedy this issue,
recent Advantage-Weighted (AW) methods prioritize samples with high advantage
values for agent training while inevitably leading to overfitting on these
samples. In this paper, we introduce a novel Advantage-Aware Policy
Optimization (A2PO) method to explicitly construct advantage-aware policy
constraints for offline learning under mixed-quality datasets. Specifically,
A2PO employs a Conditional Variational Auto-Encoder (CVAE) to disentangle the
action distributions of intertwined behavior policies by modeling the advantage
values of all training data as conditional variables. Then the agent can follow
such disentangled action distribution constraints to optimize the
advantage-aware policy towards high advantage values. Extensive experiments
conducted on both the single-quality and mixed-quality datasets of the D4RL
benchmark demonstrate that A2PO yields results superior to state-of-the-art
counterparts. Our code will be made publicly available.

离线强化学习通过利用脱机数据集来制定有效的智能体策略而无需在线交互，以克服行为策略所支持的适当保守约束来解决分布不匹配问题。本文引入了一种新的 Advantage-Aware Policy Optimization (A2PO) 方法，用于在混合质量数据集下明确构建基于优势感知的策略约束进行离线学习。通过使用条件变分自编码器 (CVAE) 来解开错综复杂的行为策略的动作分布，并将所有训练数据的优势值建模为条件变量，A2PO 可以遵循这种解开的行为分布约束来优化面向高优势值的策略。在 D4RL 基准测试中，对单一质量和混合质量的数据集进行的广泛实验表明，A2PO 的结果优于现有的最先进的对手。我们的代码将公开发布。