Learning to perform tasks by leveraging a dataset of expert observations,
also known as imitation learning from observations (ILO), is an important
paradigm for learning skills without access to the expert reward function or
the expert actions. We consider ILO in the setting where the expert and the
learner agents operate in different environments, with the source of the
discrepancy being the transition dynamics model. Recent methods for scalable
ILO utilize adversarial learning to match the state-transition distributions of
the expert and the learner, an approach that becomes challenging when the
dynamics are dissimilar. In this work, we propose an algorithm that trains an
intermediary policy in the learner environment and uses it as a surrogate
expert for the learner. The intermediary policy is learned such that the state
transitions generated by it are close to the state transitions in the expert
dataset. To derive a practical and scalable algorithm, we employ concepts from
prior work on estimating the support of a probability distribution. Experiments
using MuJoCo locomotion tasks highlight that our method compares favorably to
the baselines for ILO with transition dynamics mismatch.

本文提出了一种利用中间策略来训练学习者的一种方法，该中间策略可以近似地执行专家的策略，以便用于不同环境下的模仿学习，并在 MuJoCo 运动任务中取得了良好的结果。