We consider a Bayesian approach to offline model-based inverse reinforcement
learning (IRL). The proposed framework differs from existing offline
model-based IRL approaches by performing simultaneous estimation of the
expert's reward function and subjective model of environment dynamics. We make
use of a class of prior distributions which parameterizes how accurate the
expert's model of the environment is to develop efficient algorithms to
estimate the expert's reward and subjective dynamics in high-dimensional
settings. Our analysis reveals a novel insight that the estimated policy
exhibits robust performance when the expert is believed (a priori) to have a
highly accurate model of the environment. We verify this observation in the
MuJoCo environments and show that our algorithms outperform state-of-the-art
offline IRL algorithms.

我们提出了一种贝叶斯方法来进行离线模型基于的逆向强化学习 (IRL)。该方法通过同时估计专家的奖励函数和对环境动态的主观模型，与现有的离线模型基于 IRL 方法有所不同。我们利用一类先验分布，参数化了专家对环境的模型准确性，以此开发出高维环境中估计专家奖励和主观动态的高效算法。我们的分析揭示了一个新的观点，即当先验认为专家对环境有高度准确的模型时，估计出的策略表现出了稳健的性能。我们在 MuJoCo 环境中验证了这个观察结果，并展示了我们的算法在离线 IRL 问题上优于最先进的方法。