We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.

本论文介绍了一种基于状态观测的逆强化学习算法IL-flOw，其将奖励建模与策略学习解耦，并利用深度密度估计方法生成奖励信号，避免了对抗训练方法的不稳定性问题。通过使用状态转移概率密度作为正向强化学习的奖励信号，实验结果展示了在大规模机器人控制任务上的优秀表现。

基于归一化流的观测式模仿学习(IL-flOw)