Constrained Reinforcement Learning (CRL) tackles sequential decision-making problems where agents are required to achieve goals by maximizing the expected return while meeting domain-specific constraints, which are often formulated as expected costs. In this setting, policy-based methods are widely used since they come with several advantages when dealing with continuous-control problems. These methods search in the policy space with an action-based or parameter-based exploration strategy, depending on whether they learn directly the parameters of a stochastic policy or those of a stochastic hyperpolicy. In this paper, we propose a general framework for addressing CRL problems via gradient-based primal-dual algorithms, relying on an alternate ascent/descent scheme with dual-variable regularization. We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-iterate convergence guarantees under (weak) gradient domination assumptions, improving and generalizing existing results. Then, we design C-PGAE and C-PGPE, the action-based and the parameter-based versions of C-PG, respectively, and we illustrate how they naturally extend to constraints defined in terms of risk measures over the costs, as it is often requested in safety-critical scenarios. Finally, we numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines, demonstrating their effectiveness.

通过梯度下降的原始对偶算法，我们提出了一种通用框架来解决受限强化学习问题，并介绍了基于动作和参数的版本C-PGAE和C-PGPE。我们在受约束控制问题上进行了数值验证，并与最先进的基准进行了比较，证明了算法的有效性。

约束强化学习的政策梯度最后迭代全局收敛