In a single-agent setting, reinforcement learning (RL) tasks can be cast into an inference problem by introducing a binary random variable o, which stands for the "optimality". In this paper, we redefine the binary random variable o in multi-agent setting and formalize multi-agent reinforcement learning (MARL) as probabilistic inference. We derive a variational lower bound of the likelihood of achieving the optimality and name it as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO). From ROMMEO, we present a novel perspective on opponent modeling and show how it can improve the performance of training agents theoretically and empirically in cooperative games. To optimize ROMMEO, we first introduce a tabular Q-iteration method ROMMEO-Q with proof of convergence. We extend the exact algorithm to complex environments by proposing an approximate version, ROMMEO-AC. We evaluate these two algorithms on the challenging iterated matrix game and differential game respectively and show that they can outperform strong MARL baselines.

本研究在多智能体环境下，重新定义二元随机变量$o$并将多智能体强化学习形式化为概率推理。我们提出了一种名为ROMMEO的正则对手模型最大熵目标的变分下界，并从中展示了一种对手建模的新方法，理论和实证地证明其在协作游戏中可以提高训练智能体的性能。我们引入了一种名为ROMMEO-Q的表格Q迭代方法，并将其扩展为复杂环境下的ROMMEO-AC的近似版本，我们在挑战性的迭代矩阵游戏和微分游戏上评估了这两种算法，证明它们可以胜过强的多智能体强化学习基线。

带最大熵目标的正则对手模型