Learning to play optimally against any mixture over a diverse set of
strategies is of important practical interests in competitive games. In this
paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously:
i) learning a population of strategically diverse basis policies, represented
by a single conditional network; ii) using the same network, learn
best-responses to any mixture over the simplex of basis policies. We show that
the resulting conditional policies incorporate prior information about their
opponents effectively, enabling near optimal returns against arbitrary mixture
policies in a game with tractable best-responses. We verify that such policies
behave Bayes-optimally under uncertainty and offer insights in using this
flexibility at test time. Finally, we offer evidence that learning
best-responses to any mixture policies is an effective auxiliary task for
strategic exploration, which, by itself, can lead to more performant
populations.

本文提出了 Simplex-NeuPL 算法，通过基础策略的单个条件网络来学习代表策略上的多样性，同时学习最佳响应。实验结果表明，该算法能够有效地处理不确定性，并在测试时提供更好的表现。此外，学习任意混合策略的最佳响应是一种有效的战略探索辅助任务，可以提高性能。

简单形神经元群体学习：在对称零和博弈中的任意混合贝叶斯最优性

Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games

I study a game of strategic exploration with private payoffs and public
actions in a Bayesian bandit setting. In particular, I look at cascade
equilibria, in which agents switch over time from the risky action to the
riskless action only when they become sufficiently pessimistic. I show that
these equilibria exist under some conditions and establish their salient
properties. Individual exploration in these equilibria can be more or less than
the single-agent level depending on whether the agents start out with a common
prior or not, but the most optimistic agent always underexplores. I also show
that allowing the agents to write enforceable ex-ante contracts will lead to
the most ex-ante optimistic agent to buy all payoff streams, providing an
explanation to the buying out of smaller start-ups by more established firms.

本文研究了一种带私人回报和公共行动的策略探索游戏，特别关注级联均衡，在这种均衡中，代理人会随着时间的推移从风险动作转换为无风险动作，仅当他们变得足够悲观。我们证明了在某些条件下这些均衡存在，并确定了它们的显著性质，同时研究了让代理人书写可执行的前期合同，从而提供了一个解释为什么更成熟的公司收购较小初创公司的收益流的方式。