In this paper, we consider the problem of learning safe policies for
probabilistic-constrained reinforcement learning (RL). Specifically, a safe
policy or controller is one that, with high probability, maintains the
trajectory of the agent in a given safe set. We establish a connection between
this probabilistic-constrained setting and the cumulative-constrained
formulation that is frequently explored in the existing literature. We provide
theoretical bounds elucidating that the probabilistic-constrained setting
offers a better trade-off in terms of optimality and safety (constraint
satisfaction). The challenge encountered when dealing with the probabilistic
constraints, as explored in this work, arises from the absence of explicit
expressions for their gradients. Our prior work provides such an explicit
gradient expression for probabilistic constraints which we term Safe Policy
Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved
gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE,
which is substantiated by our theoretical results. A noteworthy aspect of both
SPGs is their inherent algorithm independence, rendering them versatile for
application across a range of policy-based algorithms. Furthermore, we propose
a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe
policies. It is subsequently followed by theoretical analyses that encompass
the convergence of the algorithm, as well as the near-optimality and
feasibility on average. In addition, we test the proposed approaches by a
series of empirical experiments. These experiments aim to examine and analyze
the inherent trade-offs between the optimality and safety, and serve to
substantiate the efficacy of two SPGs, as well as our theoretical
contributions.

本文探讨了在概率受限制的强化学习中学习安全策略的问题，并提出了两种算法 ——Safe Policy Gradient-REINFORCE 和 SPG-Actor-Critic 以及 Safe Primal-Dual 算法来解决。通过实验，验证了这些方法的有效性和优越性。