This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.

该论文提出了一种名为CLARE的算法，该算法通过将“保守性”纳入学习的奖励函数并利用估计的动力学模型来解决离线逆强化学习中的奖励外推错误问题，其得到的学习奖励函数是高度可指导后续的学习，通过大量实验证明了CLARE相较于现有最先进算法在MuJoCo连续控制任务上的明显性能提升。

CLARE: 离线反向强化学习中保守的基于模型的奖励学习