Current approaches to learning cooperative behaviors in multi-agent settings
assume relatively restrictive settings. In standard fully cooperative
multi-agent reinforcement learning, the learning algorithm controls
\textit{all} agents in the scenario, while in ad hoc teamwork, the learning
algorithm usually assumes control over only a $\textit{single}$ agent in the
scenario. However, many cooperative settings in the real world are much less
restrictive. For example, in an autonomous driving scenario, a company might
train its cars with the same learning algorithm, yet once on the road, these
cars must cooperate with cars from another company. Towards generalizing the
class of scenarios that cooperative learning methods can address, we introduce
$N$-agent ad hoc teamwork, in which a set of autonomous agents must interact
and cooperate with dynamically varying numbers and types of teammates at
evaluation time. This paper formalizes the problem, and proposes the
$\textit{Policy Optimization with Agent Modelling}$ (POAM) algorithm. POAM is a
policy gradient, multi-agent reinforcement learning approach to the NAHT
problem, that enables adaptation to diverse teammate behaviors by learning
representations of teammate behaviors. Empirical evaluation on StarCraft II
tasks shows that POAM improves cooperative task returns compared to baseline
approaches, and enables out-of-distribution generalization to unseen teammates.

在多智能体环境中学习合作行为的现有方法通常假设相对限制性的情景，在完全合作的多智能体强化学习中，学习算法控制着场景中的所有智能体，而在特定团队合作中，学习算法通常只控制场景中的单个智能体。然而，在现实世界中，许多合作场景要求更灵活的学习方法。本文提出了 N - 智能体特定团队合作算法（POAM），用于解决在评估阶段必须与动态变化的不同类型的队友进行交互和合作的智能体问题，并通过学习队友行为的表示来适应各种队友行为。在《星际争霸 II》任务的实证评估中，POAM 相对于基准方法提高了协作任务的回报，并实现了对未见过队友的分布外泛化。