In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no methods exist for determining high-confidence safety bounds for a given evaluation policy in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a method based on Bayesian Inverse Reinforcement Learning that uses demonstrations to determine practical high-confidence bounds on the difference in expected return between any evaluation policy and the expert's underlying policy. A sampling-based approach is used to obtain probabilistic confidence bounds using the financial Value at Risk metric. We empirically evaluate our proposed bound on a standard navigation task for a wide variety of ground truth reward functions. Empirical results demonstrate that our proposed bound provides significant improvements over a standard feature count-based approach: providing accurate, tight bounds even for small numbers of noisy demonstrations.

本文提出了一种基于贝叶斯思想的采样方法，可用于确定在反向强化学习环境下实际高置信度策略性绩效界限，并演示如何利用该界限进行风险感知的策略选择和改进。

逆强化学习的高效概率性能界限