Safe reinforcement learning tasks with multiple constraints are a challenging
domain despite being very common in the real world. To address this challenge,
we propose Objective Suppression, a novel method that adaptively suppresses the
task reward maximizing objectives according to a safety critic. We benchmark
Objective Suppression in two multi-constraint safety domains, including an
autonomous driving domain where any incorrect behavior can lead to disastrous
consequences. Empirically, we demonstrate that our proposed method, when
combined with existing safe RL algorithms, can match the task reward achieved
by our baselines with significantly fewer constraint violations.

通过适应性抑制任务奖励最大化目标的方法，我们提出了目标抑制（Objective Suppression）的创新方法，以解决具有多个约束的安全强化学习任务所面临的挑战，在两个多约束安全领域进行了基准测试，包括一个自动驾驶领域，其中任何不正确的行为都可能导致灾难性后果，实证上，我们证明了我们提出的方法与现有的安全强化学习算法相结合，可以在显著减少约束违规的情况下实现与我们基线的任务奖励相匹配的结果。

多约束安全强化学习与目标抑制在安全关键应用中的应用

Multi-Constraint Safe RL with Objective Suppression for Safety-Critical  Applications

When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting.

研究了实际领域中批量策略学习的问题，提出了一种系统性解决方案，包括强化学习和在线学习，其中包括多个约束条件和新的离线策略评估 (OPE) 方法，并在多个领域得到了强大的实证结果。