We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a
novel approach for inferring shared constraints in a Constrained Markov
Decision Process (CMDP) from a set of safe demonstrations with possibly
different reward functions. While previous work is limited to demonstrations
with known rewards or fully known environment dynamics, CoCoRL can learn
constraints from demonstrations with different unknown rewards without
knowledge of the environment dynamics. CoCoRL constructs a convex safe set
based on demonstrations, which provably guarantees safety even for potentially
sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL
converges to the true safe set with no policy regret. We evaluate CoCoRL in
tabular environments and a continuous driving simulation with multiple
constraints. CoCoRL learns constraints that lead to safe driving behavior and
that can be transferred to different tasks and environments. In contrast,
alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit
poor performance and learn unsafe policies.

该研究提出了凸约束学习用于强化学习的方法，该方法通过安全演示从具有可能不同奖励函数的共享约束中推断出受约束马尔可夫决策过程（CMDP）中的约束。与以往的方法不同，该方法可以从具有不同未知奖励的演示中学习约束并构建一个凸安全集，从而保证安全性，即使这些安全演示可能是次优的。该方法在表格环境和多个约束条件的连续驾驶模拟中得到了评估，并证明了可以学到安全行驶行为并且可以转移到不同的任务和环境中。