We generalise the problem of reward modelling (RM) for reinforcement learning
(RL) to handle non-Markovian rewards. Existing work assumes that human
evaluators observe each step in a trajectory independently when providing
feedback on agent behaviour. In this work, we remove this assumption, extending
RM to capture temporal dependencies in human assessment of trajectories. We
show how RM can be approached as a multiple instance learning (MIL) problem,
where trajectories are treated as bags with return labels, and steps within the
trajectories are instances with unseen reward labels. We go on to develop new
MIL models that are able to capture the time dependencies in labelled
trajectories. We demonstrate on a range of RL tasks that our novel MIL models
can reconstruct reward functions to a high level of accuracy, and can be used
to train high-performing agent policies.

本文中，我们将奖励建模应用于处理非马尔可夫奖励的强化学习问题，我们在此基础上移除了现有工作假设的独立反馈观察前提，并扩展了奖励建模以捕捉人类对轨迹的时间依赖关系。我们将其作为多实例学习（MIL）问题，通过将轨迹视为带有返回标签的包，将轨迹中的步骤视为具有未见过奖励标签的实例。我们还开发了新的多实例学习模型，能够捕捉标记轨迹中的时间依赖关系，并在一系列强化学习任务中展示了我们的新模型能够将奖励函数重建到高精度，并能用于训练高性能代理策略。

可解释的多实例学习实现基于轨迹标签的非马尔可夫奖励建模

Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning

We discuss the role of coordination as a direct learning objective in
multi-agent reinforcement learning (MARL) domains. To this end, we present a
novel means of quantifying coordination in multi-agent systems, and discuss the
implications of using such a measure to optimize coordinated agent policies.
This concept has important implications for adversary-aware RL, which we take
to be a sub-domain of multi-agent learning.

本文研究了协调在多智能体强化学习中的作用，并提出了一种定量衡量多智能体系统中的协调性的新方法，进一步讨论了采用此类指标来优化协调智能体策略的重要性，以及其在对手感知强化学习中的含义。