Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem, using observations of agent actions. As already observed by Russell the problem is ill-posed, and the reward function is not identifiable, even under the presence of perfect information about optimal behavior. We provide a resolution to this non-identifiability for problems with entropy regularization. For a given environment, we fully characterize the reward functions leading to a given policy and demonstrate that, given demonstrations of actions for the same reward under two distinct discount factors, or under sufficiently different environments, the unobserved reward can be recovered up to a constant. Through a simple numerical experiment, we demonstrate the accurate reconstruction of the reward function through our proposed resolution.

通过使用熵正则化，我们解决了马尔科夫决策问题中的奖励函数的非可辨识性问题，并完全表征给定环境下导致特定策略的奖励函数，同时演示了在不同折扣系数或足够不同的环境下给定奖励的行动演示下未被观察到的奖励可以恢复至常量。此外，我们还提供了在有限视野内对时间同质奖励和独立于行动的奖励进行重建的普遍必要和充分条件。

逆强化学习中的可识别性