Many complex multi-agent systems such as robot swarms control and autonomous
vehicle coordination can be modeled as Multi-Agent Reinforcement Learning
(MARL) tasks. QMIX, a widely popular MARL algorithm, has been used as a
baseline for the benchmark environments, e.g., Starcraft Multi-Agent Challenge
(SMAC), Difficulty-Enhanced Predator-Prey (DEPP). Recent variants of QMIX
target relaxing the monotonicity constraint of QMIX, allowing for performance
improvement in SMAC. In this paper, we investigate the code-level optimizations
of these variants and the monotonicity constraint. (1) We find that such
improvements of the variants are significantly affected by various code-level
optimizations. (2) The experiment results show that QMIX with normalized
optimizations outperforms other works in SMAC; (3) beyond the common wisdom
from these works, the monotonicity constraint can improve sample efficiency in
SMAC and DEPP. We also discuss why monotonicity constraints work well in purely
cooperative tasks with a theoretical analysis. We open-source the code at
https://github.com/hijkzzz/pymarl2.

本研究调查 QMIX 算法的代码级优化和单调性约束，揭示代码级优化对 QMIX 算法改进的显著影响，并发现在纯协作任务中，单调性约束可以提高样本效率和性能。

合作多智能体强化学习中实现技巧和单调性约束的反思

Rethinking the Implementation Tricks and Monotonicity Constraint in  Cooperative Multi-Agent Reinforcement Learning

We study the roots of algorithmic progress in deep policy gradient algorithms
through a case study on two popular algorithms: Proximal Policy Optimization
(PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate
the consequences of "code-level optimizations:" algorithm augmentations found
only in implementations or described as auxiliary details to the core
algorithm. Seemingly of secondary importance, such optimizations turn out to
have a major impact on agent behavior. Our results show that they (a) are
responsible for most of PPO's gain in cumulative reward over TRPO, and (b)
fundamentally change how RL methods function. These insights show the
difficulty and importance of attributing performance gains in deep
reinforcement learning. Code for reproducing our results is available at
this https URL .

通过对两种流行算法（PPO 和 TRPO）的案例研究，我们研究了深度策略梯度算法中算法进展的根源，并调查了 “代码级优化” 的后果：这些优化仅出现在其他实现中或被描述为核心算法的辅助详细信息，它们似乎具有次要影响，但实际上极大地影响了代理行为。我们的结果表明，它们（a）负责 PPO 在累积奖励方面比 TRPO 获得的大部分收益，并且（b）从根本上改变了 RL 方法的功能。