Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.

研究提出了一种强化学习的理论框架，旨在解决设计适当的奖励函数和保证学习策略的安全性两大挑战。文章从凸解析角度扩展了奖励可识别性和泛化性等方面的研究，并在约束马尔可夫决策过程中证明了真实奖励需要在常数范围内确定才能确保泛化到新的转移模型和约束条件。最后，文章在网格世界环境中验证了理论结果。

约束反向强化学习中的可辨识性和泛化性