MADDPG is an algorithm in multi-agent reinforcement learning (MARL) that
extends the popular single-agent method, DDPG, to multi-agent scenarios.
Importantly, DDPG is an algorithm designed for continuous action spaces, where
the gradient of the state-action value function exists. For this algorithm to
work in discrete action spaces, discrete gradient estimation must be performed.
For MADDPG, the Gumbel-Softmax (GS) estimator is used -- a reparameterisation
which relaxes a discrete distribution into a similar continuous one. This
method, however, is statistically biased, and a recent MARL benchmarking paper
suggests that this bias makes MADDPG perform poorly in grid-world situations,
where the action space is discrete. Fortunately, many alternatives to the GS
exist, boasting a wide range of properties. This paper explores several of
these alternatives and integrates them into MADDPG for discrete grid-world
scenarios. The corresponding impact on various performance metrics is then
measured and analysed. It is found that one of the proposed estimators performs
significantly better than the original GS in several tasks, achieving up to 55%
higher returns, along with faster convergence.

本文探讨了在离散动作空间的场景下，使用多种代替 Gumbel-Softmax 估计器的方法来扩展 MADDPG 算法，并对各种性能指标进行了测量和分析，结果表明，在几项任务中，其中一种提出的估计方法比原始的 Gumbel-Softmax 在返回率上表现显著更好，同时收敛更快。

重温使用 Gumbel-Softmax 的 MADDPG 算法

Revisiting the Gumbel-Softmax in MADDPG

Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving
literature that hampers the release of fully-autonomous vehicles today. Several
simulators have been in iteration after their inception to mitigate the problem
of complex scenarios with multiple agents in Autonomous Driving. One such
simulator--SMARTS, discusses the importance of cooperative multi-agent
learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which
are based on-policy and off-policy RL approaches. We compare our results with
the state-of-the-art results for this challenge and discuss the potential areas
of improvement while discussing the explainability of these approaches in
conjunction with waypoints in the SMARTS environment.

本文主要研究 Autonomous Driving 中 Multi-Agent RL 或 MARL 的问题，提出了基于 on-policy 和 off-policy RL 方法的 MAPPO 和 MADDPG 方法，并结合 SMARTS 环境中的路标讨论其可解释性和潜在改进领域。

关于智能交通系统环境下多智能体深度确定性策略梯度及其可解释性探究

On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment

We present the PowerGridworld software package to provide users with a
lightweight, modular, and customizable framework for creating
power-systems-focused, multi-agent Gym environments that readily integrate with
existing training frameworks for reinforcement learning (RL). Although many
frameworks exist for training multi-agent RL (MARL) policies, none can rapidly
prototype and develop the environments themselves, especially in the context of
heterogeneous (composite, multi-device) power systems where power flow
solutions are required to define grid-level variables and costs. PowerGridworld
is an open-source software package that helps to fill this gap. To highlight
PowerGridworld's key features, we present two case studies and demonstrate
learning MARL policies using both OpenAI's multi-agent deep deterministic
policy gradient (MADDPG) and RLLib's proximal policy optimization (PPO)
algorithms. In both cases, at least some subset of agents incorporates elements
of the power flow solution at each time step as part of their reward (negative
cost) structures.

本研究介绍了 PowerGridworld 软件包，它是一个轻量级、模块化、可定制的框架，用于创建面向电力系统的多智能体 Gym 环境，并可与现有的强化学习训练框架集成。通过两个案例研究，证明了 PowerGridworld 可以快速实现多智能体 RL 策略的学习，并支持多设备、组合式的电力系统。

PowerGridworld: 电力系统中多智能体强化学习框架

PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in  Power Systems

Policy gradient methods are often applied to reinforcement learning in
continuous multiagent games. These methods perform local search in the
joint-action space, and as we show, they are susceptable to a game-theoretic
pathology known as relative overgeneralization. To resolve this issue, we
propose Multiagent Soft Q-learning, which can be seen as the analogue of
applying Q-learning to continuous controls. We compare our method to MADDPG, a
state-of-the-art approach, and show that our method achieves better
coordination in multiagent cooperative tasks, converging to better local optima
in the joint action space.

研究了在连续多智能体博弈中应用策略梯度方法时出现的相对过度泛化问题，并提出了多智能体软 Q 学习方法来解决这个问题。与现有方法 MADDPG 相比，该方法可实现更好的多智能体协作任务协调，达到联合行为空间中更好的局部最优。