Cost functions are commonly employed in Safe Deep Reinforcement Learning
(DRL). However, the cost is typically encoded as an indicator function due to
the difficulty of quantifying the risk of policy decisions in the state space.
Such an encoding requires the agent to visit numerous unsafe states to learn a
cost-value function to drive the learning process toward safety. Hence,
increasing the number of unsafe interactions and decreasing sample efficiency.
In this paper, we investigate an alternative approach that uses domain
knowledge to quantify the risk in the proximity of such states by defining a
violation metric. This metric is computed by verifying task-level properties,
shaped as input-output conditions, and it is used as a penalty to bias the
policy away from unsafe states without learning an additional value function.
We investigate the benefits of using the violation metric in standard Safe DRL
benchmarks and robotic mapless navigation tasks. The navigation experiments
bridge the gap between Safe DRL and robotics, introducing a framework that
allows rapid testing on real robots. Our experiments show that policies trained
with the violation penalty achieve higher performance over Safe DRL baselines
and significantly reduce the number of visited unsafe states.

本文介绍了一种使用 “违规指标” 来惩罚无法确保安全的状态，从而更好地实现安全深度强化学习的方法，并在机器人地图导航任务中进行了实验研究，结果表明相较于进行 Safe DRL 的基线策略，使用违规指标的策略在性能上有了更好的表现，且能够大幅减少访问不安全状态的数量。

通过验证任务级别属性提供安全的深度强化学习

Safe Deep Reinforcement Learning by Verifying Task-Level Properties

The behavior of self-driving cars must be compatible with an enormous set of
conflicting and ambiguous objectives, from law, from ethics, from the local
culture, and so on. This paper describes a new way to conveniently define the
desired behavior for autonomous agents, which we use on the self-driving cars
developed at nuTonomy. We define a "rulebook" as a pre-ordered set of "rules",
each akin to a violation metric on the possible outcomes ("realizations"). The
rules are partially ordered by priority. The semantics of a rulebook imposes a
pre-order on the set of realizations. We study the compositional properties of
the rulebooks, and we derive which operations we can allow on the rulebooks to
preserve previously-introduced constraints. While we demonstrate the
application of these techniques in the self-driving domain, the methods are
domain-independent.

该论文介绍了一种定义自主体所需行为的新方法 ——“规则书”，包括一系列 “规则”，每个规则类似于对可能结果（“实现”）的违规度量，并通过优先级部分排序。通过规则书的语义，我们可以对实现的集合施加预排序，研究了规则书的组合特性，并得出了如何在规则书上进行操作以保留先前引入的约束条件。尽管我们在自动驾驶领域展示了这些技术的应用，但这些方法是领域无关的。