The problem of constrained reinforcement learning (CRL) holds significant
importance as it provides a framework for addressing critical safety
satisfaction concerns in the field of reinforcement learning (RL). However,
with the introduction of constraint satisfaction, the current CRL methods
necessitate the utilization of second-order optimization or primal-dual
frameworks with additional Lagrangian multipliers, resulting in increased
complexity and inefficiency during implementation. To address these issues, we
propose a novel first-order feasible method named Constrained Proximal Policy
Optimization (CPPO). By treating the CRL problem as a probabilistic inference
problem, our approach integrates the Expectation-Maximization framework to
solve it through two steps: 1) calculating the optimal policy distribution
within the feasible region (E-step), and 2) conducting a first-order update to
adjust the current policy towards the optimal policy obtained in the E-step
(M-step). We establish the relationship between the probability ratios and KL
divergence to convert the E-step into a convex optimization problem.
Furthermore, we develop an iterative heuristic algorithm from a geometric
perspective to solve this problem. Additionally, we introduce a conservative
update mechanism to overcome the constraint violation issue that occurs in the
existing feasible region method. Empirical evaluations conducted in complex and
uncertain environments validate the effectiveness of our proposed method, as it
performs at least as well as other baselines.

本文提出了一种名为 CPPO 的新型一阶可行方法，将受限强化学习问题视为概率推理问题。通过计算 E 步骤中的最优策略分布，并对当前策略进行一阶更新以调整至 E 步骤中获得的最优策略，解决了受限强化学习方法中二阶优化或原始 - 对偶框架的复杂性和低效性问题。经实验验证，该方法的有效性至少与其他基线方法一样。

约束型近端策略优化

Constrained Proximal Policy Optimization

Nonsmooth composite optimization with orthogonality constraints has a broad
spectrum of applications in statistical learning and data science. However,
this problem is generally challenging to solve due to its non-convex and
non-smooth nature. Existing solutions are limited by one or more of the
following restrictions: (i) they are full gradient methods that require high
computational costs in each iteration; (ii) they are not capable of solving
general nonsmooth composite problems; (iii) they are infeasible methods and can
only achieve the feasibility of the solution at the limit point; (iv) they lack
rigorous convergence guarantees; (v) they only obtain weak optimality of
critical points. In this paper, we propose \textit{\textbf{OBCD}}, a new Block
Coordinate Descent method for solving general nonsmooth composite problems
under Orthogonality constraints. \textit{\textbf{OBCD}} is a feasible method
with low computation complexity footprints. In each iteration, our algorithm
updates $k$ rows of the solution matrix ($k\geq2$ is a parameter) to preserve
the constraints. Then, it solves a small-sized nonsmooth composite optimization
problem under orthogonality constraints either exactly or approximately. We
demonstrate that any exact block-$k$ stationary point is always an approximate
block-$k$ stationary point, which is equivalent to the critical stationary
point. We are particularly interested in the case where $k=2$ as the resulting
subproblem reduces to a one-dimensional nonconvex problem. We propose a
breakpoint searching method and a fifth-order iterative method to solve this
problem efficiently and effectively. We also propose two novel greedy
strategies to find a good working set to further accelerate the convergence of
\textit{\textbf{OBCD}}. Finally, we have conducted extensive experiments on
several tasks to demonstrate the superiority of our approach.

提出了一种新的基于块坐标下降（OBCD）的非光滑复合优化方法，该方法能够在正交约束下解决一般的非光滑复合问题，是具备收敛保证的可行方法。