Collaboration is a key challenge in distributed multi-agent reinforcement
learning (MARL) environments. Learning frameworks for these decentralized
systems must weigh the benefits of explicit player coordination against the
communication overhead and computational cost of sharing local observations and
environmental data. Quantum computing has sparked a potential synergy between
quantum entanglement and cooperation in multi-agent environments, which could
enable more efficient distributed collaboration with minimal information
sharing. This relationship is largely unexplored, however, as current
state-of-the-art quantum MARL (QMARL) implementations rely on classical
information sharing rather than entanglement over a quantum channel as a
coordination medium. In contrast, in this paper, a novel framework dubbed
entangled QMARL (eQMARL) is proposed. The proposed eQMARL is a distributed
actor-critic framework that facilitates cooperation over a quantum channel and
eliminates local observation sharing via a quantum entangled split critic.
Introducing a quantum critic uniquely spread across the agents allows coupling
of local observation encoders through entangled input qubits over a quantum
channel, which requires no explicit sharing of local observations and reduces
classical communication overhead. Further, agent policies are tuned through
joint observation-value function estimation via joint quantum measurements,
thereby reducing the centralized computational burden. Experimental results
show that eQMARL with ${\Psi}^{+}$ entanglement converges to a cooperative
strategy up to $17.8\%$ faster and with a higher overall score compared to
split classical and fully centralized classical and quantum baselines. The
results also show that eQMARL achieves this performance with a constant factor
of $25$-times fewer centralized parameters compared to the split classical
baseline.

提出了一种名为 eQMARL 的新型框架，通过量子通道促进协作，并通过量子纠缠的分裂评论家消除本地观察共享，实验结果表明，eQMARL 相较于传统的分裂和完全中心化的经典和量子基线，能够在更短的时间内收敛到合作策略，且拥有更高的整体分数，与传统的分裂经典基线相比，eQMARL 只需要少于 25 倍的中心化参数。

eQMARL: 量子通道上分布式协作的纠缠量子多智能体强化学习

eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for  Distributed Cooperation over Quantum Channels

Recent advances in Competitive Self-Play (CSP) have achieved, or even
surpassed, human level performance in complex game environments such as Dota 2
and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL).
One core component of these methods relies on creating a pool of learning
agents -- consisting of the Main Agent, past versions of this agent, and
Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main
Agents. A key drawback of these approaches is the large computational cost and
physical time that is required to train the system, making them impractical to
deploy in highly iterative real-life settings such as video game productions.
In this paper, we propose the Minimax Exploiter, a game theoretic approach to
exploiting Main Agents that leverages knowledge of its opponents, leading to
significant increases in data efficiency. We validate our approach in a
diversity of settings, including simple turn based games, the arcade learning
environment, and For Honor, a modern video game. The Minimax Exploiter
consistently outperforms strong baselines, demonstrating improved stability and
data efficiency, leading to a robust CSP-MARL method that is both flexible and
easy to deploy.

通过对对手知识的利用，我们提出了一种博弈论方法，即 Minimax Exploiter，在竞争性自博弈的多智能体强化学习中显著提高了数据效率，并在不同环境下验证了其超越强基线的性能。

Minimax Exploiter: 数据高效的竞争自我对弈方法

Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play

Existing distributed cooperative multi-agent reinforcement learning (MARL)
frameworks usually assume undirected coordination graphs and communication
graphs while estimating a global reward via consensus algorithms for policy
evaluation. Such a framework may induce expensive communication costs and
exhibit poor scalability due to requirement of global consensus. In this work,
we study MARLs with directed coordination graphs, and propose a distributed RL
algorithm where the local policy evaluations are based on local value
functions. The local value function of each agent is obtained by local
communication with its neighbors through a directed learning-induced
communication graph, without using any consensus algorithm. A zeroth-order
optimization (ZOO) approach based on parameter perturbation is employed to
achieve gradient estimation. By comparing with existing ZOO-based RL
algorithms, we show that our proposed distributed RL algorithm guarantees high
scalability. A distributed resource allocation example is shown to illustrate
the effectiveness of our algorithm.

本文提出了一种分布式强化学习算法，该算法使用直接协调图和局部值函数，通过零阶优化方法进行条件估计，没有使用任何共识算法。与现有的基于零阶优化的强化学习算法相比，我们的算法保证了高可扩展性。