We consider the problem of imitation learning from a finite set of expert trajectories, without access to reinforcement signals. The classical approach of extracting the expert's reward function via inverse reinforcement learning, followed by reinforcement learning is indirect and may be computationally expensive. Recent generative adversarial methods based on matching the policy distribution between the expert and the agent could be unstable during training. We propose a new framework for imitation learning by estimating the support of the expert policy to compute a fixed reward function, which allows us to re-frame imitation learning within the standard reinforcement learning setting. We demonstrate the efficacy of our reward function on both discrete and continuous domains, achieving comparable or better performance than the state of the art under different reinforcement learning algorithms.

本文提出一种新的模仿学习框架，通过估计专家策略的支持来计算固定的奖励函数，将模仿学习重新定位到标准的强化学习设置中，证明了该奖励函数在离散和连续域上的有效性，并在不同的强化学习算法下实现了与现有技术相当或更好的表现。

随机专家蒸馏: 通过专家策略支持来进行模仿学习估计