We consider a problem of learning a reward and policy from expert examples under unknown dynamics in high-dimensional scenarios. Our proposed method builds on the framework of generative adversarial networks and exploits reward shaping to learn near-optimal rewards and policies. Potential-based reward shaping functions are known to guide the learning agent whereas in this paper we bring forward their benefits in learning near-optimal rewards. Our method simultaneously learns a potential-based reward shaping function through variational information maximization along with the reward and policy under the adversarial learning formulation. We evaluate our method on various high-dimensional complex control tasks. We also evaluate our learned rewards in transfer learning problems where training and testing environments are made to be different from each other in terms of dynamics or structure. Our experimentation shows that our proposed method not only learns near-optimal rewards and policies matching expert behavior, but also performs significantly better than state-of-the-art inverse reinforcement learning algorithms.

通过生成敌对网络框架，提出一种以权力为基础的正则化最大熵逆向强化学习来学习接近最优的奖励和策略，同时学习变分信息最大化下的权力，并在各种高维复杂控制任务和具有挑战性的转移学习问题上进行了评估，证明了该方法不仅匹配专家行为而且比最先进的逆向强化学习算法表现显著优异。

基于变分反强化学习的对抗性模仿