Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating Safet
本文考虑在强化学习中使用的一种流行算法Trust region policy optimization(TRPO)与传统的凸分析中自然的信任域方法之间的关系,证明了TRPO的适应性调节机制实际上是传统信任域方法的RL版本,并在规则化MDP中,展示了快速收敛速率的结果,这是关于规则化即时成本或奖励时在RL中的首个更好的结果。