Multi-Agent Policy Gradient (MAPG) has made significant progress in recent
years. However, centralized critics in state-of-the-art MAPG methods still face
the centralized-decentralized mismatch (CDM) issue, which means sub-optimal
actions by some agents will affect other agent's policy learning. While using
individual critics for policy updates can avoid this issue, they severely limit
cooperation among agents. To address this issue, we propose an agent topology
framework, which decides whether other agents should be considered in policy
gradient and achieves compromise between facilitating cooperation and
alleviating the CDM issue. The agent topology allows agents to use coalition
utility as learning objective instead of global utility by centralized critics
or local utility by individual critics. To constitute the agent topology,
various models are studied. We propose Topology-based multi-Agent Policy
gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove
the policy improvement theorem for stochastic TAPE and give a theoretical
explanation for the improved cooperation among agents. Experiment results on
several benchmarks show the agent topology is able to facilitate agent
cooperation and alleviate CDM issue respectively to improve performance of
TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm
are devised to show the efficacy of the agent topology.

提出了一个代理拓扑框架，通过在策略梯度中考虑其他代理来实现协作与解决分布一致性不匹配问题的折中方案。该代理拓扑可以使代理使用联盟效用作为学习目标，避免了全局效用或局部效用带来的问题，并通过实验结果表明能够改善 TAPE 的性能。

TAPE: 基于智能体拓扑的合作多智能体策略梯度

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy  Gradient

\textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent
learning tasks when agents converge towards a suboptimal joint policy due to
overfitting to suboptimal behavior of other agents. In early work, optimism has
been shown to mitigate the \textit{RO} problem when using tabular Q-learning.
However, with function approximation optimism can amplify overestimation and
thus fail on complex tasks. On the other hand, recent deep multi-agent policy
gradient (MAPG) methods have succeeded in many complex tasks but may fail with
severe \textit{RO}. We propose a general, yet simple, framework to enable
optimistic updates in MAPG methods and alleviate the RO problem. Specifically,
we employ a \textit{Leaky ReLU} function where a single hyperparameter selects
the degree of optimism to reshape the advantages when updating the policy.
Intuitively, our method remains optimistic toward individual actions with lower
returns which are potentially caused by other agents' sub-optimal behavior
during learning. The optimism prevents the individual agents from quickly
converging to a local optimum. We also provide a formal analysis from an
operator view to understand the proposed advantage transformation. In extensive
evaluations on diverse sets of tasks, including illustrative matrix games,
complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the
proposed method\footnote{Code can be found at
https://github.com/wenshuaizhao/optimappo.} outperforms strong baselines
on 13 out of 19 tested tasks and matches the performance on the rest.

基于乐观主义更新和激活函数的优化，解决了多智能体学习中的相对过度概括问题，并在复杂任务中表现出优异性能。

合作任务的乐观多智体策略梯度

Optimistic Multi-Agent Policy Gradient for Cooperative Tasks

Multi-agent policy gradient (MAPG) methods recently witness vigorous
progress. However, there is a significant performance discrepancy between MAPG
methods and state-of-the-art multi-agent value-based approaches. In this paper,
we investigate causes that hinder the performance of MAPG algorithms and
present a multi-agent decomposed policy gradient method (DOP). This method
introduces the idea of value function decomposition into the multi-agent
actor-critic framework. Based on this idea, DOP supports efficient off-policy
learning and addresses the issue of centralized-decentralized mismatch and
credit assignment in both discrete and continuous action spaces. We formally
show that DOP critics have sufficient representational capability to guarantee
convergence. In addition, empirical evaluations on the StarCraft II
micromanagement benchmark and multi-agent particle environments demonstrate
that DOP significantly outperforms both state-of-the-art value-based and
policy-based multi-agent reinforcement learning algorithms. Demonstrative
videos are available at this https URL

本文研究多智能体问题中现有的算法相比于最先进的价值方法存在的性能差异，并提出了一种多智能体分解的策略梯度方法，该方法引入了价值函数分解的想法，并针对离散和连续动作空间中的集中 - 分散不匹配和信用分配问题进行了解决。实验结果表明，该方法在同类算法中的表现优异。