For solving zero-sum games involving non-transitivity, a common approach is
to maintain population policies to approximate the Nash Equilibrium (NE).
Previous research has shown that the Policy Space Response Oracle (PSRO) is an
effective multi-agent reinforcement learning framework for these games.
However, repeatedly training new policies from scratch to approximate the Best
Response (BR) to opponents' mixed policies at each iteration is inefficient and
costly. While some PSRO methods initialize a new BR policy by inheriting from
past BR policies, this approach limits the exploration of new policies,
especially against challenging opponents.To address this issue, we propose
Fusion-PSRO, which uses model fusion to initialize the policy for better
approximation to BR. With Top-k probabilities from NE, we select high-quality
base policies and fuse them into a new BR policy through model averaging. This
approach allows the initialized policy to incorporate multiple expert policies,
making it easier to handle difficult opponents compared to inheriting or
initializing from scratch. Additionally, our method only modifies the policy
initialization, enabling its application to nearly all PSRO variants without
additional training overhead.Our experiments with non-transitive matrix games,
Leduc poker, and the more complex Liars Dice demonstrate that Fusion-PSRO
enhances the performance of nearly all PSRO variants, achieving lower
exploitability.

为了解决非传递性的零和游戏问题，该研究提出了一种名为 Fusion-PSRO 的方法，通过模型融合初始化策略，以更好逼近最佳反应策略，并在非传递性矩阵游戏和复杂 Liars Dice 等实验中验证了其在提高几乎所有 PSRO 变体性能方面的有效性。

融合 - PSRO：策略空间响应预言的纳什策略融合

Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

Policy Space Response Oracle methods (PSRO) provide a general solution to
learn Nash equilibrium in two-player zero-sum games but suffer from two
drawbacks: (1) the computation inefficiency due to the need for consistent
meta-game evaluation via simulations, and (2) the exploration inefficiency due
to finding the best response against a fixed meta-strategy at every epoch. In
this work, we propose Efficient PSRO (EPSRO) that largely improves the
efficiency of the above two steps. Central to our development is the
newly-introduced subroutine of no-regret optimization on the
unrestricted-restricted (URR) game. By solving URR at each epoch, one can
evaluate the current game and compute the best response in one forward pass
without the need for meta-game simulations. Theoretically, we prove that the
solution procedures of EPSRO offer a monotonic improvement on the
exploitability, which none of existing PSRO methods possess. Furthermore, we
prove that the no-regret optimization has a regret bound of
$\mathcal{O}(\sqrt{T\log{[(k^2+k)/2]}})$, where $k$ is the size of restricted
policy set. Most importantly, a desirable property of EPSRO is that it is
parallelizable, this allows for highly efficient exploration in the policy
space that induces behavioral diversity. We test EPSRO on three classes of
games, and report a 50x speedup in wall-time and 10x data efficiency while
maintaining similar exploitability as existing PSRO methods on Kuhn and Leduc
Poker games.

该研究提出了 Efficient PSRO 方法来解决传统 Policy Space Response Oracle 方法中存在的计算和探索效率低的问题，通过引入 no-regret optimization 和 parallelization 等技术，有效地优化了算法，在保证 Kuhn 和 Leduc Poker 博弈中的可利用度的情况下，提高了 50x 的速度和 10 倍的数据效率。