Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

本研究着眼于安全强化学习中的一个关键问题，即现有安全指标未能有效区分成本的累积方式。我们提出了一种新的指标——预期最大连续成本步数（EMCC），能够更准确评估不安全步骤的严重性，从而提高训练过程中的安全性。研究表明，该指标在区分延续性和偶然性安全违规方面表现出色，并通过一系列基准测试验证了其有效性。

重新审视安全探索中的安全强化学习