Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating safety constraints. However, common solution methods often compromise reward maximization by being overly conservative or allow unsafe behavior during training. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints and yields trust regions composed exclusively of safe policies, ensuring constraint satisfaction throughout training. We theoretically study the convergence and update properties of C-TRPO and highlight connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Finally, we demonstrate experimentally that C-TRPO significantly reduces constraint violations while achieving competitive reward maximization compared to state-of-the-art CMDP algorithms.

本研究解决了强化学习中存在的不安全行为问题，提出了一种新的方法——受限信任区域策略优化（C-TRPO），通过根据安全约束调整策略空间的几何结构，确保训练过程中的约束满足。实验结果表明，C-TRPO在显著减少约束违规的同时，与最先进的受限马尔可夫决策过程算法相比，在奖励最大化方面具备竞争力。

将安全性嵌入强化学习：信任区域方法的新视角