To achieve general intelligence, agents must learn how to interact with
others in a shared environment: this is the challenge of multiagent
reinforcement learning (MARL). The simplest form is independent reinforcement
learning (InRL), where each agent treats its experience as part of its
(non-stationary) environment. In this paper, we first observe that policies
learned using InRL can overfit to the other agents' policies during training,
failing to sufficiently generalize during execution. We introduce a new metric,
joint-policy correlation, to quantify this effect. We describe an algorithm for
general MARL, based on approximate best responses to mixtures of policies
generated using deep reinforcement learning, and empirical game-theoretic
analysis to compute meta-strategies for policy selection. The algorithm
generalizes previous ones such as InRL, iterated best response, double oracle,
and fictitious play. Then, we present a scalable implementation which reduces
the memory requirement using decoupled meta-solvers. Finally, we demonstrate
the generality of the resulting policies in two partially observable settings:
gridworld coordination games and poker.

本文提出了一种基于深度强化学习的近似最佳响应策略混合和实证博弈理论分析的算法，用以解决多智能体强化学习中独立强化学习过度拟合其他智能体政策的问题，并且在网格世界协调游戏和扑克牌等部分可观察环境中取得了不错的结果.