In this paper, we consider the problem of learning safe policies for
probabilistic-constrained reinforcement learning (RL). Specifically, a safe
policy or controller is one that, with high probability, maintains the
trajectory of the agent in a given safe set. We establish a connection between
this probabilistic-constrained setting and the cumulative-constrained
formulation that is frequently explored in the existing literature. We provide
theoretical bounds elucidating that the probabilistic-constrained setting
offers a better trade-off in terms of optimality and safety (constraint
satisfaction). The challenge encountered when dealing with the probabilistic
constraints, as explored in this work, arises from the absence of explicit
expressions for their gradients. Our prior work provides such an explicit
gradient expression for probabilistic constraints which we term Safe Policy
Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved
gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE,
which is substantiated by our theoretical results. A noteworthy aspect of both
SPGs is their inherent algorithm independence, rendering them versatile for
application across a range of policy-based algorithms. Furthermore, we propose
a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe
policies. It is subsequently followed by theoretical analyses that encompass
the convergence of the algorithm, as well as the near-optimality and
feasibility on average. In addition, we test the proposed approaches by a
series of empirical experiments. These experiments aim to examine and analyze
the inherent trade-offs between the optimality and safety, and serve to
substantiate the efficacy of two SPGs, as well as our theoretical
contributions.

本文探讨了在概率受限制的强化学习中学习安全策略的问题，并提出了两种算法 ——Safe Policy Gradient-REINFORCE 和 SPG-Actor-Critic 以及 Safe Primal-Dual 算法来解决。通过实验，验证了这些方法的有效性和优越性。

安全关键强化学习的概率约束

Probabilistic Constraint for Safety-Critical Reinforcement Learning

An important problem in sequential decision-making under uncertainty is to
use limited data to compute a safe policy, i.e., a policy that is guaranteed to
perform at least as well as a given baseline strategy. In this paper, we
develop and analyze a new model-based approach to compute a safe policy when we
have access to an inaccurate dynamics model of the system with known accuracy
guarantees. Our proposed robust method uses this (inaccurate) model to directly
minimize the (negative) regret w.r.t. the baseline policy. Contrary to the
existing approaches, minimizing the regret allows one to improve the baseline
policy in states with accurate dynamics and seamlessly fall back to the
baseline policy, otherwise. We show that our formulation is NP-hard and propose
an approximate algorithm. Our empirical results on several domains show that
even this relatively simple approximate algorithm can significantly outperform
standard approaches.

该研究论文提出一种基于模型的方法，使用有限数据计算安全策略，并使用已知的准确性保证对系统的不准确动态模型进行分析，以直接最小化关于基线策略的（负）遗憾，从而改进基础策略并在准确动态的情况下连续地使用，在遇到不准确动态的情况下无缝地回退到基线策略。