Recent advances in constrained reinforcement learning (RL) have endowed
reinforcement learning with certain safety guarantees. However, deploying
existing constrained RL algorithms in continuous control tasks with general
hard constraints remains challenging, particularly in those situations with
non-convex hard constraints. Inspired by the generalized reduced gradient (GRG)
algorithm, a classical constrained optimization technique, we propose a reduced
policy optimization (RPO) algorithm that combines RL with GRG to address
general hard constraints. RPO partitions actions into basic actions and
nonbasic actions following the GRG method and outputs the basic actions via a
policy network. Subsequently, RPO calculates the nonbasic actions by solving
equations based on equality constraints using the obtained basic actions. The
policy network is then updated by implicitly differentiating nonbasic actions
with respect to basic actions. Additionally, we introduce an action projection
procedure based on the reduced gradient and apply a modified Lagrangian
relaxation technique to ensure inequality constraints are satisfied. To the
best of our knowledge, RPO is the first attempt that introduces GRG to RL as a
way of efficiently handling both equality and inequality hard constraints. It
is worth noting that there is currently a lack of RL environments with complex
hard constraints, which motivates us to develop three new benchmarks: two
robotics manipulation tasks and a smart grid operation control task. With these
benchmarks, RPO achieves better performance than previous constrained RL
algorithms in terms of both cumulative reward and constraint violation. We
believe RPO, along with the new benchmarks, will open up new opportunities for
applying RL to real-world problems with complex constraints.

近期有关约束强化学习的研究进展为强化学习提供了一定的安全性保证。本文介绍了一种将 RL 与 GRG 相结合的减少策略优化算法 (RPO)，用于处理存在非凸硬约束条件的连续控制任务。通过将动作分为基本动作和非基本动作，RPO 算法采用了 GRG 的方法生成基本动作，并通过等式约束求解得到非基本动作。另外，还引入了基于减少梯度的动作投影过程，并应用改进的拉格朗日松弛技术来确保不等式约束得到满足。此外，为了解决目前缺乏复杂硬约束环境的问题，我们开发了三个新的基准测试任务：两个机器人操作任务和一个智能电网运行控制任务。通过这些基准测试，RPO 算法在累积奖励和约束违规方面显示出比之前的约束强化学习算法更好的性能。我们相信 RPO 算法及其新的基准测试将为将 RL 应用于具有复杂约束的现实问题打开新的机遇。