This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.

本文研究了多智能体强化学习在部分可观察性下的挑战性任务，其中每个智能体只能看到自己的观察和动作。我们通过考虑广义模型的部分可观察马尔科夫博弈，证明了一个富裕的子类可以使用样本高效的学习方法，从而找到弱显式部分可观察马尔科夫博弈的近似纳什均衡、相关均衡以及粗略相关均衡，当代理数量很小时可在多项式样本复杂度内学得。

部分可观马尔可夫博弈中高效学习的样本有效强化学习