Policy-Space Response Oracles (PSRO) as a general algorithmic framework has
achieved state-of-the-art performance in learning equilibrium policies of
two-player zero-sum games. However, the hand-crafted hyperparameter value
selection in most of the existing works requires extensive domain knowledge,
forming the main barrier to applying PSRO to different games. In this work, we
make the first attempt to investigate the possibility of self-adaptively
determining the optimal hyperparameter values in the PSRO framework. Our
contributions are three-fold: (1) Using several hyperparameters, we propose a
parametric PSRO that unifies the gradient descent ascent (GDA) and different
PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the
hyperparameter value selection of the parametric PSRO as a hyperparameter
optimization (HPO) problem where our objective is to learn an HPO policy that
can self-adaptively determine the optimal hyperparameter values during the
running of the parametric PSRO. (3) To overcome the poor performance of online
HPO methods, we propose a novel offline HPO approach to optimize the HPO policy
based on the Transformer architecture. Experiments on various two-player
zero-sum games demonstrate the superiority of SPSRO over different baselines.

通过使用 Transformer 架构，我们提出了一种自适应超参数选择的参数化策略空间响应预言机（PSRO）方法，该方法在各种双人零和游戏中展现出优越的性能。

自适应的 PSRO：走向一种自动基于人口的游戏求解器

Self-adaptive PSRO: Towards an Automatic Population-based Game Solver

Policy Space Response Oracles (PSRO) is a reinforcement learning (RL)
algorithm for two-player zero-sum games that has been empirically shown to find
approximate Nash equilibria in large games. Although PSRO is guaranteed to
converge to an approximate Nash equilibrium and can handle continuous actions,
it may take an exponential number of iterations as the number of information
states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an
extensive-form double oracle algorithm for two-player zero-sum games that is
guaranteed to converge to an approximate Nash equilibrium linearly in the
number of infostates. Unlike PSRO, which mixes best responses at the root of
the game, XDO mixes best responses at every infostate. We also introduce Neural
XDO (NXDO), where the best response is learned through deep RL. In tabular
experiments on Leduc poker, we find that XDO achieves an approximate Nash
equilibrium in a number of iterations an order of magnitude smaller than PSRO.
Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO
achieves a lower exploitability than CFR with the same amount of computation.
We also find that NXDO outperforms PSRO and NFSP on a sequential
multidimensional continuous-action game. NXDO is the first deep RL method that
can find an approximate Nash equilibrium in high-dimensional continuous-action
sequential games. Experiment code is available at
this https URL

本文提出 Policy Space Response Oracles (PSRO)、Extensive-Form Double Oracle (XDO) 和 Neural XDO 三种算法，中其中 XDO 更适用于大型博弈中的二人零和游戏，与 PSRO 相比，可以线性收敛至近似纳什均衡。在实验中，XDO 和 NXDO 取得了优异的性能表现。

XDO：一种用于外部形式博弈的双预言机算法

XDO: A Double Oracle Algorithm for Extensive-Form Games

This paper investigates a population-based training regime based on
game-theoretic principles called Policy-Spaced Response Oracles (PSRO). PSRO is
general in the sense that it (1) encompasses well-known algorithms such as
fictitious play and double oracle as special cases, and (2) in principle
applies to general-sum, many-player games. Despite this, prior studies of PSRO
have been focused on two-player zero-sum games, a regime wherein Nash
equilibria are tractably computable. In moving from two-player zero-sum games
to more general settings, computation of Nash equilibria quickly becomes
infeasible. Here, we extend the theoretical underpinnings of PSRO by
considering an alternative solution concept, $\alpha$-Rank, which is unique
(thus faces no equilibrium selection issues, unlike Nash) and applies readily
to general-sum, many-player settings. We establish convergence guarantees in
several games classes, and identify links between Nash equilibria and
$\alpha$-Rank. We demonstrate the competitive performance of
$\alpha$-Rank-based PSRO against an exact Nash solver-based PSRO in 2-player
Kuhn and Leduc Poker. We then go beyond the reach of prior PSRO applications by
considering 3- to 5-player poker games, yielding instances where $\alpha$-Rank
achieves faster convergence than approximate Nash solvers, thus establishing it
as a favorable general games solver. We also carry out an initial empirical
validation in MuJoCo soccer, illustrating the feasibility of the proposed
approach in another complex domain.

本文基于博弈论原理研究了一种基于人口统计的培训体系 —— 策略空间响应神谕（PSRO），并将其扩展到广义和多人游戏中。通过使用另一种解决方案概念 $\alpha$-Rank，在一些游戏分类中建立了收敛保证，并确定了 Nash 平衡和 $\alpha$-Rank 之间的联系。实验结果表明，基于 $\alpha$-Rank 的 PSRO 可以在很多游戏中实现比近似 Nash Solver 更快的收敛速度。