BriefGPT.xyz
Nov, 2024
将安全性嵌入强化学习:信任区域方法的新视角
Embedding Safety into RL: A New Take on Trust Region Methods
HTML
PDF
Nikola Milosevic, Johannes Müller, Nico Scherf
TL;DR
本研究解决了强化学习中存在的不安全行为问题,提出了一种新的方法——受限信任区域策略优化(C-TRPO),通过根据安全约束调整策略空间的几何结构,确保训练过程中的约束满足。实验结果表明,C-TRPO在显著减少约束违规的同时,与最先进的受限马尔可夫决策过程算法相比,在奖励最大化方面具备竞争力。
Abstract
Reinforcement Learning
(RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating
Safet
→