Effective agent coordination is crucial in cooperative Multi-Agent
Reinforcement Learning (MARL). While agent cooperation can be represented by
graph structures, prevailing graph learning methods in MARL are limited. They
rely solely on one-step observations, neglecting crucial historical
experiences, leading to deficient graphs that foster redundant or detrimental
information exchanges. Additionally, high computational demands for action-pair
calculations in dense graphs impede scalability. To address these challenges,
we propose inferring a Latent Temporal Sparse Coordination Graph (LTS-CG) for
MARL. The LTS-CG leverages agents' historical observations to calculate an
agent-pair probability matrix, where a sparse graph is sampled from and used
for knowledge exchange between agents, thereby simultaneously capturing agent
dependencies and relation uncertainty. The computational complexity of this
procedure is only related to the number of agents. This graph learning process
is further augmented by two innovative characteristics: Predict-Future, which
enables agents to foresee upcoming observations, and Infer-Present, ensuring a
thorough grasp of the environmental context from limited data. These features
allow LTS-CG to construct temporal graphs from historical and real-time
information, promoting knowledge exchange during policy learning and effective
collaboration. Graph learning and agent training occur simultaneously in an
end-to-end manner. Our demonstrated results on the StarCraft II benchmark
underscore LTS-CG's superior performance.

在合作多智能体强化学习中，有效的智能体协调至关重要。为了解决现有方法中对历史经验的忽视和稠密图计算的可扩展性问题，我们提出了一种基于潜在时间稀疏协调图的多智能体强化学习方法。该方法利用智能体的历史观测计算智能体对概率矩阵，并基于此矩阵生成稀疏图，以促进智能体之间的知识交流，同时捕捉智能体之间的依赖关系和关系不确定性。该方法还引入了 “预测未来” 和 “推断现在” 两个创新特性，使得该方法能够从有限数据中构建历史和实时信息的时间图，促进策略学习和有效协作。实验结果表明，该方法在 StarCraft II 基准测试中具有卓越的性能。

多智能体强化学习中推断潜在时间稀疏协调图

Inferring Latent Temporal Sparse Coordination Graph for Multi-Agent  Reinforcement Learning

Correlated Equilibrium (CE) is a well-established solution concept that
captures coordination among agents and enjoys good algorithmic properties. In
real-world multi-agent systems, in addition to being in an equilibrium, agents'
policies are often expected to meet requirements with respect to safety, and
fairness. Such additional requirements can often be expressed in terms of the
state density which measures the state-visitation frequencies during the course
of a game. However, existing CE notions or CE-finding approaches cannot
explicitly specify a CE with particular properties concerning state density;
they do so implicitly by either modifying reward functions or using value
functions as the selection criteria. The resulting CE may thus not fully fulfil
the state-density requirements. In this paper, we propose Density-Based
Correlated Equilibria (DBCE), a new notion of CE that explicitly takes state
density as selection criterion. Concretely, we instantiate DBCE by specifying
different state-density requirements motivated by real-world applications. To
compute DBCE, we put forward the Density Based Correlated Policy Iteration
algorithm for the underlying control problem. We perform experiments on various
games where results demonstrate the advantage of our CE-finding approach over
existing methods in scenarios with state-density concerns.

本文提出了基于状态密度的相关均衡（DBCE），它是一种新的 CE 概念，可以更好地满足代理协调和实际应用的安全性和公平性的需求，并通过实验表明了其在状态密度问题场景下的优越性。

学习马尔可夫博弈中基于密度相关均衡的策略

Learning Density-Based Correlated Equilibria for Markov Games

Modern multi-agent reinforcement learning frameworks rely on centralized
training and reward shaping to perform well. However, centralized training and
dense rewards are not readily available in the real world. Current multi-agent
algorithms struggle to learn in the alternative setup of decentralized training
or sparse rewards. To address these issues, we propose a self-supervised
intrinsic reward ELIGN - expectation alignment - inspired by the
self-organization principle in Zoology. Similar to how animals collaborate in a
decentralized manner with those in their vicinity, agents trained with
expectation alignment learn behaviors that match their neighbors' expectations.
This allows the agents to learn collaborative behaviors without any external
reward or centralized training. We demonstrate the efficacy of our approach
across 6 tasks in the multi-agent particle and the complex Google Research
football environments, comparing ELIGN to sparse and curiosity-based intrinsic
rewards. When the number of agents increases, ELIGN scales well in all
multi-agent tasks except for one where agents have different capabilities. We
show that agent coordination improves through expectation alignment because
agents learn to divide tasks amongst themselves, break coordination symmetries,
and confuse adversaries. These results identify tasks where expectation
alignment is a more useful strategy than curiosity-driven exploration for
multi-agent coordination, enabling agents to do zero-shot coordination.

该研究探讨在分散型训练或稀疏奖励的情况下，提出了一种自我监督的本质奖励 ELIGN - 期望对齐 - 以及其在多智能体协调问题上的有效性。通过期望对齐代理能够学习到协作行为并且可以进行零次协调，这比基于好奇心的探索方法更加可行。