We identify two issues with the family of algorithms based on the Adversarial
Imitation Learning framework. The first problem is implicit bias present in the
reward functions used in these algorithms. While these biases might work well
for some environments, they can also lead to sub-optimal behavior in others.
Secondly, even though these algorithms can learn from few expert
demonstrations, they require a prohibitively large number of interactions with
the environment in order to imitate the expert for many real-world
applications. In order to address these issues, we propose a new algorithm
called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning
to reduce policy-environment interaction sample complexity by an average factor
of 10. Furthermore, since our reward function is designed to be unbiased, we
can apply our algorithm to many problems without making any task-specific
adjustments.

使用新算法 “Discriminator-Actor-Critic”，提出解决基于 Adversarial Imitation Learning 框架的两个问题：隐式偏差和与环境产生大量交互所需的复杂性。该算法使用离线策略强化学习来降低策略 - 环境交互采样复杂度，并且由于我们的奖励函数是设计为无偏差的，因此可以在许多问题上应用而不进行任何任务特定的调整。