A fundamental challenge in multiagent reinforcement learning is to learn
beneficial behaviors in a shared environment with other simultaneously learning
agents. In particular, each agent perceives the environment as effectively
non-stationary due to the changing policies of other agents. Moreover, each
agent is itself constantly learning, leading to natural non-stationarity in the
distribution of experiences encountered. In this paper, we propose a novel
meta-multiagent policy gradient theorem that directly accounts for the
non-stationary policy dynamics inherent to multiagent learning settings. This
is achieved by modeling our gradient updates to consider both an agent's own
non-stationary policy dynamics and the non-stationary policy dynamics of other
agents in the environment. We show that our theoretically grounded approach
provides a general solution to the multiagent learning problem, which
inherently comprises all key aspects of previous state of the art approaches on
this topic. We test our method on a diverse suite of multiagent benchmarks and
demonstrate a more efficient ability to adapt to new agents as they learn than
baseline methods across the full spectrum of mixed incentive, competitive, and
cooperative domains.

本研究提出了一种新的元多智能体策略梯度定理，该定理直接考虑到多智能体学习环境中固有的非稳态策略动态，并通过建模梯度更新以考虑智能体自身的非稳态策略动态以及环境中其他代理的非稳态策略动态来达成。在多种多智能体基准测试中，我们的方法能够在全谱的混合激励、竞争和合作领域中比基线方法更有效地适应学习新的代理。