While combining imitation learning (IL) and reinforcement learning (RL) is a
promising way to address poor sample efficiency in autonomous behavior
acquisition, methods that do so typically assume that the requisite behavior
demonstrations are provided by an expert that behaves optimally with respect to
a task reward. If, however, suboptimal demonstrations are provided, a
fundamental challenge appears in that the demonstration-matching objective of
IL conflicts with the return-maximization objective of RL. This paper
introduces D-Shape, a new method for combining IL and RL that uses ideas from
reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape
allows learning from suboptimal demonstrations while retaining the ability to
find the optimal policy with respect to the task reward. We experimentally
validate D-Shape in sparse-reward gridworld domains, showing that it both
improves over RL in terms of sample efficiency and converges consistently to
the optimal policy in the presence of suboptimal demonstrations.

本文介绍一种新的结合模仿学习和强化学习的方法 D-Shape，它使用奖励塑造和目标条件强化学习来解决模仿学习所带来的与强化学习目标冲突的问题，从而实现在子优示范时学习，但同时还保持了相对于任务奖励的最优策略。我们在稀疏奖励的网格世界领域进行了实验，并证明了 D-Shape 在提高样本效率和处理子优示范的同时始终能够收敛到最优策略。