Multiagent reinforcement learning (MARL) has benefited significantly from
population-based and game-theoretic training regimes. One approach,
Policy-Space Response Oracles (PSRO), employs standard reinforcement learning
to compute response policies via approximate best responses and combines them
via meta-strategy selection. We augment PSRO by adding a novel search procedure
with generative sampling of world states, and introduce two new meta-strategy
solvers based on the Nash bargaining solution. We evaluate PSRO's ability to
compute approximate Nash equilibrium, and its performance in two negotiation
games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where
human participants negotiate with our agents ($N = 346$). We find that search
with generative modeling finds stronger policies during both training time and
test time, enables online Bayesian co-player prediction, and can produce agents
that achieve comparable social welfare negotiating with humans as humans
trading among themselves.

本文介绍了一种增强型多智能体系统训练框架 PSRO（Policy-Space Response Oracles），并通过添加一种新颖的搜索程序和生成抽样方法进行增强，进一步引入了基于 Nash 议价解的两种新元策略解决方法。在谈判博弈中进行的实验表明，这种方法能够成功地计算近似 Nash 平衡，并且可以产生与人类谈判相当的代理人。