In multi-agent settings with mixed incentives, methods developed for zero-sum
games have been shown to lead to detrimental outcomes. To address this issue,
opponent shaping (OS) methods explicitly learn to influence the learning
dynamics of co-players and empirically lead to improved individual and
collective outcomes. However, OS methods have only been evaluated in
low-dimensional environments due to the challenges associated with estimating
higher-order derivatives or scaling model-free meta-learning. Alternative
methods that scale to more complex settings either converge to undesirable
solutions or rely on unrealistic assumptions about the environment or
co-players. In this paper, we successfully scale an OS-based approach to
general-sum games with temporally-extended actions and long-time horizons for
the first time. After analysing the representations of the meta-state and
history used by previous algorithms, we propose a simplified version called
Shaper. We show empirically that Shaper leads to improved individual and
collective outcomes in a range of challenging settings from literature. We
further formalize a technique previously implicit in the literature, and
analyse its contribution to opponent shaping. We show empirically that this
technique is helpful for the functioning of prior methods in certain
environments. Lastly, we show that previous environments, such as the CoinGame,
are inadequate for analysing temporally-extended general-sum interactions.

对于混合激励的多智能体环境中，通过学习对博弈对手产生影响的对手塑造方法，我们成功将其扩展到具有长期行动和长期视角的广义和博弈，提出了一个称为 Shaper 的简化版本，并证明 Shaper 在多种具有挑战性的环境中能够改善个体和整体的结果。

高维对手塑造的扩展

Scaling Opponent Shaping to High Dimensional Games

Function approximation (FA) has been a critical component in solving large
zero-sum games. Yet, little attention has been given towards FA in solving
\textit{general-sum} extensive-form games, despite them being widely regarded
as being computationally more challenging than their fully competitive or
cooperative counterparts. A key challenge is that for many equilibria in
general-sum games, no simple analogue to the state value function used in
Markov Decision Processes and zero-sum games exists. In this paper, we propose
learning the \textit{Enforceable Payoff Frontier} (EPF) -- a generalization of
the state value function for general-sum games. We approximate the optimal
\textit{Stackelberg extensive-form correlated equilibrium} by representing EPFs
with neural networks and training them by using appropriate backup operations
and loss functions. This is the first method that applies FA to the Stackelberg
setting, allowing us to scale to much larger games while still enjoying
performance guarantees based on FA error. Additionally, our proposed method
guarantees incentive compatibility and is easy to evaluate without having to
depend on self-play or approximate best-response oracles.

本研究提出了一种基于神经网络的函数近似方法，应用于广义和博弈的 Stackelberg 博弈情景，以学习 Enforceable Payoff Frontier，从而实现对博弈策略的近似计算和评估。

大规模完全信息博弈中求解领导者 - 追随者均衡的函数逼近

Function Approximation for Solving Stackelberg Equilibrium in Large  Perfect Information Games

We study the problem of finding optimal correlated equilibria of various
sorts: normal-form coarse correlated equilibrium (NFCCE), extensive-form coarse
correlated equilibrium (EFCCE), and extensive-form correlated equilibrium
(EFCE). This is NP-hard in the general case and has been studied in special
cases, most notably triangle-free games, which include all two-player games
with public chance moves. However, the general case is not well understood, and
algorithms usually scale poorly. First, we introduce the correlation DAG, a
representation of the space of correlated strategies whose size is dependent on
the specific solution concept. It extends the team belief DAG of Zhang et al.
to general-sum games. For each of the three solution concepts, its size depends
exponentially only on a parameter related to the game's information structure.
We also prove a fundamental complexity gap: while our size bounds for NFCCE are
similar to those achieved in the case of team games by Zhang et al., this is
impossible to achieve for the other two concepts under standard complexity
assumptions. Second, we propose a two-sided column generation approach to
compute optimal correlated strategies. Our algorithm improves upon the
one-sided approach of Farina et al. by means of a new decomposition of
correlated strategies which allows players to re-optimize their sequence-form
strategies with respect to correlation plans which were previously added to the
support. Our techniques outperform the prior state of the art for computing
optimal general-sum correlated equilibria. For team games, the two-sided column
generation approach vastly outperforms standard column generation approaches,
making it the state of the art algorithm when the parameter is large. Along the
way we also introduce two new benchmark games: a trick-taking game that
emulates the endgame phase of the card game bridge, and a ride-sharing game.

研究了针对不同类型的协同均衡的最优相关策略问题，提出了相关 DAG 表示方法和双面列生成算法来计算最优策略并探讨其复杂性分析，探究了一些新的基准博弈。

广义积和博弈中的最优相关均衡：固定参数算法，难度和双向列生成

Optimal Correlated Equilibria in General-Sum Extensive-Form Games: Fixed-Parameter Algorithms, Hardness, and Two-Sided Column-Generation

Learning in general-sum games is unstable and frequently leads to socially
undesirable (Pareto-dominated) outcomes. To mitigate this, Learning with
Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting,
by accounting for each agent's influence on their opponents' anticipated
learning steps. However, the original LOLA formulation (and follow-up work) is
inconsistent because LOLA models other agents as naive learners rather than
LOLA agents. In previous work, this inconsistency was suggested as a cause of
LOLA's failure to preserve stable fixed points (SFPs). First, we formalize
consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency
problem if it converges. Second, we correct a claim made in the literature by
Schäfer and Anandkumar (2019), proving that Competitive Gradient Descent
(CGD) does not recover HOLA as a series expansion (and fails to solve the
consistency problem). Third, we propose a new method called Consistent LOLA
(COLA), which learns update functions that are consistent under mutual opponent
shaping. It requires no more than second-order derivatives and learns
consistent update functions even when HOLA fails to converge. However, we also
prove that even consistent update functions do not preserve SFPs, contradicting
the hypothesis that this shortcoming is caused by LOLA's inconsistency.
Finally, in an empirical evaluation on a set of general-sum games, we find that
COLA finds prosocial solutions and that it converges under a wider range of
learning rates than HOLA and LOLA. We support the latter finding with a
theoretical result for a simple game.

通过在 LOLA 算法中引入一种方法称为 Consistent LOLA，其中学习更新功能在彼此影响时保持一致，作者在广义和游戏模型中进行了一系列实验，发现这种方法比 HOLA 和 LOLA 更容易收敛，并能够找到更加符合社会期望的解决方案。