In multi-agent settings with mixed incentives, methods developed for zero-sum
games have been shown to lead to detrimental outcomes. To address this issue,
opponent shaping (OS) methods explicitly learn to influence the learning
dynamics of co-players and empirically lead to improved individual and
collective outcomes. However, OS methods have only been evaluated in
low-dimensional environments due to the challenges associated with estimating
higher-order derivatives or scaling model-free meta-learning. Alternative
methods that scale to more complex settings either converge to undesirable
solutions or rely on unrealistic assumptions about the environment or
co-players. In this paper, we successfully scale an OS-based approach to
general-sum games with temporally-extended actions and long-time horizons for
the first time. After analysing the representations of the meta-state and
history used by previous algorithms, we propose a simplified version called
Shaper. We show empirically that Shaper leads to improved individual and
collective outcomes in a range of challenging settings from literature. We
further formalize a technique previously implicit in the literature, and
analyse its contribution to opponent shaping. We show empirically that this
technique is helpful for the functioning of prior methods in certain
environments. Lastly, we show that previous environments, such as the CoinGame,
are inadequate for analysing temporally-extended general-sum interactions.

对于混合激励的多智能体环境中，通过学习对博弈对手产生影响的对手塑造方法，我们成功将其扩展到具有长期行动和长期视角的广义和博弈，提出了一个称为 Shaper 的简化版本，并证明 Shaper 在多种具有挑战性的环境中能够改善个体和整体的结果。

高维对手塑造的扩展

Scaling Opponent Shaping to High Dimensional Games

We propose a new method, called PiZero, that gives an agent the ability to
plan in an abstract search space of its own creation that is completely
decoupled from the real environment. Unlike prior approaches, this enables the
agent to perform high-level planning at arbitrary timescales and reason in
terms of compound or temporally-extended actions, which can be useful in
environments where large numbers of base-level micro-actions are needed to
perform relevant macro-actions. In addition, our method is more general than
comparable prior methods because it handles settings with continuous action
spaces and partial observability. We evaluate our method on multiple domains,
including navigation tasks and Sokoban. Experimentally, it outperforms
comparable prior methods without assuming access to an environment simulator.

我们提出了一种名为 PiZero 的新方法，使代理能够在完全与真实环境脱节的自主创建的抽象搜索空间中进行规划。与之前的方法不同，这使得代理能够在任意时间尺度上进行高层规划，并以复合或时间扩展的动作形式进行推理，这在需要执行大量基础微动作来执行相关宏动作的环境中非常有用。此外，我们的方法比可比较的之前方法更通用，因为它处理具有连续动作空间和部分可观察性的设置。我们在多个领域，包括导航任务和 Sokoban，对我们的方法进行了评估。实验结果表明，我们的方法在没有假设访问环境模拟器的情况下，优于可比较的之前方法。

想象中的规划：基于学习生成的抽象搜索空间的高级规划

Planning in the imagination: High-level planning on learned abstract  search spaces

Multi-agent policy gradient methods have demonstrated success in games and
robotics but are often limited to problems with low-level action space.
However, when agents take higher-level, temporally-extended actions (i.e.
options), when and how to derive a centralized control policy, its gradient as
well as sampling options for all agents while not interrupting current option
executions, becomes a challenge. This is mostly because agents may choose and
terminate their options \textit{asynchronously}. In this work, we propose a
conditional reasoning approach to address this problem, and empirically
validate its effectiveness on representative option-based multi-agent
cooperative tasks.

本文提出了一种条件推理方法，以解决多智能体协作任务中的高级行为空间集中控制和梯度获取问题，并在代表性的基于选项的多智能体协作任务上验证了其有效性。