Existing studies on constrained reinforcement learning (RL) may obtain a
well-performing policy in the training environment. However, when deployed in a
real environment, it may easily violate constraints that were originally
satisfied during training because there might be model mismatch between the
training and real environments. To address the above challenge, we formulate
the problem as constrained RL under model uncertainty, where the goal is to
learn a good policy that optimizes the reward and at the same time satisfy the
constraint under model mismatch. We develop a Robust Constrained Policy
Optimization (RCPO) algorithm, which is the first algorithm that applies to
large/continuous state space and has theoretical guarantees on worst-case
reward improvement and constraint violation at each iteration during the
training. We demonstrate the effectiveness of our algorithm on a set of RL
tasks with constraints.

在训练环境下，现有的关于约束强化学习（RL）的研究可能可以获得良好的策略。然而，在真实环境中部署时，由于训练与真实环境之间可能存在模型不匹配，它可能很容易违反最初满足的约束。为了解决上述挑战，我们将问题形式化为模型不确定性下的约束强化学习，即旨在学习一个能够优化奖励并同时满足模型不匹配下的约束的良好策略。我们提出了一种名为鲁棒约束策略优化（RCPO）的算法，这是一种适用于大型 / 连续状态空间且在训练期间每次迭代都具有最坏情况奖励改进和约束违规的理论保证的算法。我们在一组具有约束条件的强化学习任务上展示了我们算法的有效性。