In recent years we have seen fast progress on a number of benchmark problems
in AI, with modern methods achieving near or super human performance in Go,
Poker and Dota. One common aspect of all of these challenges is that they are
by design adversarial or, technically speaking, zero-sum. In contrast to these
settings, success in the real world commonly requires humans to collaborate and
communicate with others, in settings that are, at least partially, cooperative.
In the last year, the card game Hanabi has been established as a new benchmark
environment for AI to fill this gap. In particular, Hanabi is interesting to
humans since it is entirely focused on theory of mind, i.e., the ability to
effectively reason over the intentions, beliefs and point of view of other
agents when observing their actions. Learning to be informative when observed
by others is an interesting challenge for Reinforcement Learning (RL):
Fundamentally, RL requires agents to explore in order to discover good
policies. However, when done naively, this randomness will inherently make
their actions less informative to others during training. We present a new deep
multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this
contradiction exploiting the centralized training phase. During training SAD
allows other agents to not only observe the (exploratory) action chosen, but
agents instead also observe the greedy action of their team mates. By combining
this simple intuition with best practices for multi-agent learning, SAD
establishes a new SOTA for learning methods for 2-5 players on the self-play
part of the Hanabi challenge. Our ablations show the contributions of SAD
compared with the best practice components. All of our code and trained agents
are available at this https URL

该研究提出了一种基于深度多智能体强化学习方法，即 Simplified Action Decoder（SAD），它通过利用集中式训练阶段解决了训练过程中策略非常难以观察的问题，从而在 Hanabi 挑战赛的部分元素中，建立了一个新的 SOTA，提高了理解其他网络的能力。

深度多智能体强化学习的简化行动解码器

Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning

Recent years have witnessed significant advances in reinforcement learning
(RL), which has registered great success in solving various sequential
decision-making problems in machine learning. Most of the successful RL
applications, e.g., the games of Go and Poker, robotics, and autonomous
driving, involve the participation of more than one single agent, which
naturally fall into the realm of multi-agent RL (MARL), a domain with a
relatively long history, and has recently re-emerged due to advances in
single-agent RL techniques. Though empirically successful, theoretical
foundations for MARL are relatively lacking in the literature. In this chapter,
we provide a selective overview of MARL, with focus on algorithms backed by
theoretical analysis. More specifically, we review the theoretical results of
MARL algorithms mainly within two representative frameworks, Markov/stochastic
games and extensive-form games, in accordance with the types of tasks they
address, i.e., fully cooperative, fully competitive, and a mix of the two. We
also introduce several significant but challenging applications of these
algorithms. Orthogonal to the existing reviews on MARL, we highlight several
new angles and taxonomies of MARL theory, including learning in extensive-form
games, decentralized MARL with networked agents, MARL in the mean-field regime,
(non-)convergence of policy-based methods for learning in games, etc. Some of
the new angles extrapolate from our own research endeavors and interests. Our
overall goal with this chapter is, beyond providing an assessment of the
current state of the field on the mark, to identify fruitful future research
directions on theoretical studies of MARL. We expect this chapter to serve as
continuing stimulus for researchers interested in working on this exciting
while challenging topic.

本文在 selective 的视角下提供了多智能体强化学习领域的理论分析综述，重点关注 Markov/stochastic games 和 extensive-form games 框架下的 MARL 算法的理论结果，并突出了 MARL 理论的几个新角度和分类，探讨了在学习博弈论、分散式多智能体、平均场与（非）收敛、多类型任务等方面的有前途的未来研究方向。

多智能体强化学习：理论和算法的选择性概述

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and  Algorithms

Reinforcement learning (RL) is a promising data-driven approach for adaptive
traffic signal control (ATSC) in complex urban traffic networks, and deep
neural networks further enhance its learning power. However, centralized RL is
infeasible for large-scale ATSC due to the extremely high dimension of the
joint action space. Multi-agent RL (MARL) overcomes the scalability issue by
distributing the global control to each local RL agent, but it introduces new
challenges: now the environment becomes partially observable from the viewpoint
of each local agent due to limited communication among agents. Most existing
studies in MARL focus on designing efficient communication and coordination
among traditional Q-learning agents. This paper presents, for the first time, a
fully scalable and decentralized MARL algorithm for the state-of-the-art deep
RL agent: advantage actor critic (A2C), within the context of ATSC. In
particular, two methods are proposed to stabilize the learning procedure, by
improving the observability and reducing the learning difficulty of each local
agent. The proposed multi-agent A2C is compared against independent A2C and
independent Q-learning algorithms, in both a large synthetic traffic grid and a
large real-world traffic network of Monaco city, under simulated peak-hour
traffic dynamics. Results demonstrate its optimality, robustness, and sample
efficiency over other state-of-the-art decentralized MARL algorithms.

本文提出了一种可完全扩展和去中心化的多智能体 A2C 算法，以提高城市交通网络中的自适应交通信号控制的可观测性和减少学习难度，并在大型合成交通网格和摩纳哥城​​的大型实际交通网络下，通过模拟高峰流量动态，并将其与独立 A2C 和独立 Q-learning 算法进行比较，结果表明其优化性、鲁棒性和样本效率优于其他最先进的去中心化 MARL 算法。

多智能体深度强化学习用于大规模交通信号控制

Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal  Control

A growing number of learning methods are actually differentiable games whose
players optimise multiple, interdependent objectives in parallel -- from GANs
and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful
approach to improve learning dynamics in these games, accounting for player
influence on others' updates. Learning with Opponent-Learning Awareness (LOLA)
is a recent algorithm that exploits this response and leads to cooperation in
settings like the Iterated Prisoner's Dilemma. Although experimentally
successful, we show that LOLA agents can exhibit 'arrogant' behaviour directly
at odds with convergence. In fact, remarkably few algorithms have theoretical
guarantees applying across all (n-player, non-convex) games. In this paper we
present Stable Opponent Shaping (SOS), a new method that interpolates between
LOLA and a stable variant named LookAhead. We prove that LookAhead converges
locally to equilibria and avoids strict saddles in all differentiable games.
SOS inherits these essential guarantees, while also shaping the learning of
opponents and consistently either matching or outperforming LOLA
experimentally.

该论文提出了稳定对手塑造方法，该方法通过插值实现了区分对手学习（LOLA）和稳定对手塑造的最佳属性，并在可微分游戏中表现出卓越的性能。