Stackelberg equilibria arise naturally in a range of popular learning
problems, such as in security games or indirect mechanism design, and have
received increasing attention in the reinforcement learning literature. We
present a general framework for implementing Stackelberg equilibria search as a
multi-agent RL problem, allowing a wide range of algorithmic design choices. We
discuss how previous approaches can be seen as specific instantiations of this
framework. As a key insight, we note that the design space allows for
approaches not previously seen in the literature, for instance by leveraging
multitask and meta-RL techniques for follower convergence. We propose one such
approach using contextual policies, and evaluate it experimentally on both
standard and novel benchmark domains, showing greatly improved sample
efficiency compared to previous approaches. Finally, we explore the effect of
adopting algorithm designs outside the borders of our framework.

本研究提出了一种将 Stackelberg 平衡搜索实现为多智能体强化学习问题的通用框架，并借助多任务和元强化学习技术实现了一种使用情境策略的方法，在标准和新颖的基准领域上进行了实验，并显示出较以前的方法大大提高的样本效率。同时，我们探讨了超出我们框架边界的算法设计所带来的影响。

深度多智能体强化学习中的斯塔克伯格均衡：神谕和追随者

Oracles & Followers: Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning

Can we use reinforcement learning to learn general-purpose policies that can
perform a wide range of different tasks, resulting in flexible and reusable
skills? Contextual policies provide this capability in principle, but the
representation of the context determines the degree of generalization and
expressivity. Categorical contexts preclude generalization to entirely new
tasks. Goal-conditioned policies may enable some generalization, but cannot
capture all tasks that might be desired. In this paper, we propose goal
distributions as a general and broadly applicable task representation suitable
for contextual policies. Goal distributions are general in the sense that they
can represent any state-based reward function when equipped with an appropriate
distribution class, while the particular choice of distribution class allows us
to trade off expressivity and learnability. We develop an off-policy algorithm
called distribution-conditioned reinforcement learning (DisCo RL) to
efficiently learn these policies. We evaluate DisCo RL on a variety of robot
manipulation tasks and find that it significantly outperforms prior methods on
tasks that require generalization to new goal distributions.

本文提出了一种基于目标分布的通用任务表征方法，通过该方法可以实现针对不同任务的灵活重用技能，并开发了一种离策略算法 (Distribution-Conditioned Reinforcement Learning, DisCo RL) 来高效地学习这些策略。在多种机器人操作任务上的实验表明，该方法显著优于先前的方法，尤其是需要对新目标分布进行泛化的任务。