We study reinforcement learning (RL) with no-reward demonstrations, a setting
in which an RL agent has access to additional data from the interaction of
other agents with the same environment. However, it has no access to the
rewards or goals of these agents, and their objectives and levels of expertise
may vary widely. These assumptions are common in multi-agent settings, such as
autonomous driving. To effectively use this data, we turn to the framework of
successor features. This allows us to disentangle shared features and dynamics
of the environment from agent-specific rewards and policies. We propose a
multi-task inverse reinforcement learning (IRL) algorithm, called \emph{inverse
temporal difference learning} (ITD), that learns shared state features,
alongside per-agent successor features and preference vectors, purely from
demonstrations without reward labels. We further show how to seamlessly
integrate ITD with learning from online environment interactions, arriving at a
novel algorithm for reinforcement learning with demonstrations, called $\Psi
\Phi$-learning (pronounced `Sci-Fi'). We provide empirical evidence for the
effectiveness of $\Psi \Phi$-learning as a method for improving RL, IRL,
imitation, and few-shot transfer, and derive worst-case bounds for its
performance in zero-shot transfer to new tasks.

该研究探讨了如何利用没有奖励标签的演示样本进行强化学习，提出了一种基于继承特征的多任务反向强化学习算法，证明了该算法在零样本迁移方面的有效性和性能上界，并且展示了在几种应用领域的具体应用。