One of the challenges in applying reinforcement learning in a complex
real-world environment lies in providing the agent with a sufficiently detailed
reward function. Any misalignment between the reward and the desired behavior
can result in unwanted outcomes. This may lead to issues like "reward hacking"
where the agent maximizes rewards by unintended behavior. In this work, we
propose to disentangle the reward into two distinct parts. A simple
task-specific reward, outlining the particulars of the task at hand, and an
unknown common-sense reward, indicating the expected behavior of the agent
within the environment. We then explore how this common-sense reward can be
learned from expert demonstrations. We first show that inverse reinforcement
learning, even when it succeeds in training an agent, does not learn a useful
reward function. That is, training a new agent with the learned reward does not
impair the desired behaviors. We then demonstrate that this problem can be
solved by training simultaneously on multiple tasks. That is, multi-task
inverse reinforcement learning can be applied to learn a useful reward
function.

通过将奖励分解为两个不同的部分，即任务特定奖励和常识奖励，并探讨后者如何从专家示范中学习，我们解决了在复杂真实环境中应用强化学习所存在的奖励功能不准确所导致的问题，并证明通过多任务逆向强化学习能够学习到一个有用的奖励函数。

常识奖励的多任务逆强化学习

Multi Task Inverse Reinforcement Learning for Common Sense Reward

We study reinforcement learning (RL) with no-reward demonstrations, a setting
in which an RL agent has access to additional data from the interaction of
other agents with the same environment. However, it has no access to the
rewards or goals of these agents, and their objectives and levels of expertise
may vary widely. These assumptions are common in multi-agent settings, such as
autonomous driving. To effectively use this data, we turn to the framework of
successor features. This allows us to disentangle shared features and dynamics
of the environment from agent-specific rewards and policies. We propose a
multi-task inverse reinforcement learning (IRL) algorithm, called \emph{inverse
temporal difference learning} (ITD), that learns shared state features,
alongside per-agent successor features and preference vectors, purely from
demonstrations without reward labels. We further show how to seamlessly
integrate ITD with learning from online environment interactions, arriving at a
novel algorithm for reinforcement learning with demonstrations, called $\Psi
\Phi$-learning (pronounced `Sci-Fi'). We provide empirical evidence for the
effectiveness of $\Psi \Phi$-learning as a method for improving RL, IRL,
imitation, and few-shot transfer, and derive worst-case bounds for its
performance in zero-shot transfer to new tasks.

该研究探讨了如何利用没有奖励标签的演示样本进行强化学习，提出了一种基于继承特征的多任务反向强化学习算法，证明了该算法在零样本迁移方面的有效性和性能上界，并且展示了在几种应用领域的具体应用。

基于后继特征与逆时序差分学习的示范增强学习 (PsiPhi-Learning)

PsiPhi-Learning: Reinforcement Learning with Demonstrations using  Successor Features and Inverse Temporal Difference Learning

Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring
multiple reward functions from expert demonstrations. Prior work, built on
Bayesian IRL, is unable to scale to complex environments due to computational
constraints. This paper contributes a formulation of multi-task IRL in the more
computationally efficient Maximum Causal Entropy (MCE) IRL framework.
Experiments show our approach can perform one-shot imitation learning in a
gridworld environment that single-task IRL algorithms need hundreds of
demonstrations to solve. We outline preliminary work using meta-learning to
extend our method to the function approximator setting of modern MCE IRL
algorithms. Evaluating on multi-task variants of common simulated robotics
benchmarks, we discover serious limitations of these IRL algorithms, and
conclude with suggestions for further work.

本文提出了在最大因果熵 IRL 框架中的多任务 IRL 的公式化，通过单次模仿学习可以解决单任务 IRL 算法需要数百个演示才能解决的网格环境中的问题，并在常见模拟机器人基准的多任务变体上评估这些 IRL 算法，发现了其严重的限制，并提出了进一步的工作建议。