Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. In this paper, we propose a Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) algorithm to strike a balance between the exploration and the constraints. In the early stage, our method loosens the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method gradually meets the cost limit's demand in the final training stage. When evaluated on Safety-Gym and Bullet-Safety-Gym benchmarks, our method has shown its advantages over baseline algorithms in terms of safety and optimality. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with CPO algorithm.

本文提出了一种ESB-CPO算法，通过在早期阶段增加额外的安全预算来平衡探索和约束，以提高过程的效率，证明其在保证安全性的基础上能够显著提高性能。

利用额外安全预算在受限策略优化中进行高效探索