Incorporating safety is an essential prerequisite for broadening the
practical applications of reinforcement learning in real-world scenarios. To
tackle this challenge, Constrained Markov Decision Processes (CMDPs) are
leveraged, which introduce a distinct cost function representing safety
violations. In CMDPs' settings, Lagrangian relaxation technique has been
employed in previous algorithms to convert constrained optimization problems
into unconstrained dual problems. However, these algorithms may inaccurately
predict unsafe behavior, resulting in instability while learning the Lagrange
multiplier. This study introduces a novel safe reinforcement learning
algorithm, Safety Critic Policy Optimization (SCPO). In this study, we define
the safety critic, a mechanism that nullifies rewards obtained through
violating safety constraints. Furthermore, our theoretical analysis indicates
that the proposed algorithm can automatically balance the trade-off between
adhering to safety constraints and maximizing rewards. The effectiveness of the
SCPO algorithm is empirically validated by benchmarking it against strong
baselines.

本研究介绍了一种新的安全强化学习算法（Safety Critic Policy Optimization，SCPO），通过引入安全评判机制，该算法能够自动平衡遵守安全限制和最大化奖励之间的权衡，并在实证验证中证明了其有效性。

SCPO: 带安全评论家策略优化的安全强化学习

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization

Safety exploration can be regarded as a constrained Markov decision problem
where the expected long-term cost is constrained. Previous off-policy
algorithms convert the constrained optimization problem into the corresponding
unconstrained dual problem by introducing the Lagrangian relaxation technique.
However, the cost function of the above algorithms provides inaccurate
estimations and causes the instability of the Lagrange multiplier learning. In
this paper, we present a novel off-policy reinforcement learning algorithm
called Conservative Distributional Maximum a Posteriori Policy Optimization
(CDMPO). At first, to accurately judge whether the current situation satisfies
the constraints, CDMPO adapts distributional reinforcement learning method to
estimate the Q-function and C-function. Then, CDMPO uses a conservative value
function loss to reduce the number of violations of constraints during the
exploration process. In addition, we utilize Weighted Average Proportional
Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical
results show that the proposed method has fewer violations of constraints in
the early exploration process. The final test results also illustrate that our
method has better risk control.

本文提出了一种名为约束保守分布最大后验策略优化（CDMPO）的离线强化学习算法用于安全探索中的约束决策问题，其中利用分布式强化学习方法准确估计 Q 函数和 C 函数，并利用保守的价值函数损失来减少违反约束的次数，同时使用加权平均比例积分微分（WAPID）来稳定更新拉格朗日乘子，在实验中表现出更好的风险控制能力。