Imitation Learning (IL) is a popular paradigm for training agents to achieve
complicated goals by leveraging expert behavior, rather than dealing with the
hardships of designing a correct reward function. With the environment modeled
as a Markov Decision Process (MDP), most of the existing IL algorithms are
contingent on the availability of expert demonstrations in the same MDP as the
one in which a new imitator policy is to be learned. This is uncharacteristic
of many real-life scenarios where discrepancies between the expert and the
imitator MDPs are common, especially in the transition dynamics function.
Furthermore, obtaining expert actions may be costly or infeasible, making the
recent trend towards state-only IL (where expert demonstrations constitute only
states or observations) ever so promising. Building on recent adversarial
imitation approaches that are motivated by the idea of divergence minimization,
we present a new state-only IL algorithm in this paper. It divides the overall
optimization objective into two subproblems by introducing an indirection step
and solves the subproblems iteratively. We show that our algorithm is
particularly effective when there is a transition dynamics mismatch between the
expert and imitator MDPs, while the baseline IL methods suffer from performance
degradation. To analyze this, we construct several interesting MDPs by
modifying the configuration parameters for the MuJoCo locomotion tasks from
OpenAI Gym.

本文介绍了一种新的基于状态的模仿学习算法，借助最近的对抗模仿学习方法，通过将总体优化目标分解为两个子问题并迭代地解决这些子问题，解决了专家和模仿者 MDPs 之间的过渡动力学不匹配问题。作者还通过 OpenAI Gym 的 MuJoCo 运动任务构造了几个有趣的 Markov 决策过程，分析表明该算法在存在过渡动力学不匹配现象时特别有效，而基线 IL 方法则会导致性能下降。

具有状态的模仿学习与转移动力学不匹配

State-only Imitation with Transition Dynamics Mismatch

We show that a critical vulnerability in adversarial imitation is the
tendency of discriminator networks to learn spurious associations between
visual features and expert labels. When the discriminator focuses on
task-irrelevant features, it does not provide an informative reward signal,
leading to poor task performance. We analyze this problem in detail and propose
a solution that outperforms standard Generative Adversarial Imitation Learning
(GAIL). Our proposed method, Task-Relevant Adversarial Imitation Learning
(TRAIL), uses constrained discriminator optimization to learn informative
rewards. In comprehensive experiments, we show that TRAIL can solve challenging
robotic manipulation tasks from pixels by imitating human operators without
access to any task rewards, and clearly outperforms comparable baseline
imitation agents, including those trained via behaviour cloning and
conventional GAIL.

本文探讨对抗模仿领域中鉴别器网络学习视觉特征与专家标签之间的虚假关联所造成的关键漏洞，并提出了一种新的解决方案 (TRAIL)，该方法通过优化受限的鉴别器来获得 informative rewards。在实验中，我们展示了 TRAIL 能够在没有访问任何任务奖励的情况下，通过模仿人类的操作来解决具有挑战性的机器人操纵任务，并明显优于其他基于行为克隆和传统 GAIL 的对抗模仿代理模型。