Safe reinforcement learning (RL) has achieved significant success on
risk-sensitive tasks and shown promise in autonomous driving (AD) as well.
Considering the distinctiveness of this community, efficient and reproducible
baselines are still lacking for safe AD. In this paper, we release SafeRL-Kit
to benchmark safe RL methods for AD-oriented tasks. Concretely, SafeRL-Kit
contains several latest algorithms specific to zero-constraint-violation tasks,
including Safety Layer, Recovery RL, off-policy Lagrangian method, and Feasible
Actor-Critic. In addition to existing approaches, we propose a novel
first-order method named Exact Penalty Optimization (EPO) and sufficiently
demonstrate its capability in safe AD. All algorithms in SafeRL-Kit are
implemented (i) under the off-policy setting, which improves sample efficiency
and can better leverage past logs; (ii) with a unified learning framework,
providing off-the-shelf interfaces for researchers to incorporate their
domain-specific knowledge into fundamental safe RL methods. Conclusively, we
conduct a comparative evaluation of the above algorithms in SafeRL-Kit and shed
light on their efficacy for safe autonomous driving. The source code is
available at \href{ this https URL}{this https URL}.

本文提供 SafeRL-Kit 工具套件，其中包括最新的针对零违约任务的算法，详细比较 SafeRL-Kit 中的算法在安全自主驾驶方面的实用性，其中包括一个新的一阶方法 Exact Penalty Optimization (EPO)。

SafeRL-Kit：用于安全自主驾驶的高效强化学习方法评估

SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving

Safety remains a central obstacle preventing widespread use of RL in the real
world: learning new tasks in uncertain environments requires extensive
exploration, but safety requires limiting exploration. We propose Recovery RL,
an algorithm which navigates this tradeoff by (1) leveraging offline data to
learn about constraint violating zones before policy learning and (2)
separating the goals of improving task performance and constraint satisfaction
across two policies: a task policy that only optimizes the task reward and a
recovery policy that guides the agent to safety when constraint violation is
likely. We evaluate Recovery RL on 6 simulation domains, including two
contact-rich manipulation tasks and an image-based navigation task, and an
image-based obstacle avoidance task on a physical robot. We compare Recovery RL
to 5 prior safe RL methods which jointly optimize for task performance and
safety via constrained optimization or reward shaping and find that Recovery RL
outperforms the next best prior method across all domains. Results suggest that
Recovery RL trades off constraint violations and task successes 2 - 20 times
more efficiently in simulation domains and 3 times more efficiently in physical
experiments. See this https URL for videos and supplementary
material.

本文提出了一种名为 Recovery RL 的算法，它通过利用离线数据来学习约束违规区域并将任务性能和约束满足的目标分别交给两个策略来平衡任务收益与安全性，并在六个仿真领域和一个物理机器人上进行了试验，证明 Recovery RL 在这些领域内比先前的安全 RL 方法具有更高的效率和表现。