Centralised training with decentralised execution is an important setting for cooperative deep multi-agent reinforcement learning due to communication constraints during execution and computational tractability in training. In this paper, we analyse value-based methods that are known to have superior performance in complex environments [43]. We specifically focus on QMIX [40], the current state-of-the-art in this domain. We show that the representational constraints on the joint action-values introduced by QMIX and similar methods lead to provably poor exploration and suboptimality. Furthermore, we propose a novel approach called MAVEN that hybridises value and policy-based methods by introducing a latent space for hierarchical control. The value-based agents condition their behaviour on the shared latent variable controlled by a hierarchical policy. This allows MAVEN to achieve committed, temporally extended exploration, which is key to solving complex multi-agent tasks. Our experimental results show that MAVEN achieves significant performance improvements on the challenging SMAC domain [43].

本文提出一种名为MAVEN的新方法，该方法结合了价值和基于策略的方法，引入了层次控制的潜在空间来解决QMIX和类似方法中的行动值表示约束引起的探索不足和次优现象。MAVEN可以实现承诺和延时探索，在具有挑战性的SMAC动态负载均衡问题上取得了显着的性能提升，是一种解决复杂多智能体任务的重要方法。

MAVEN: 多智能体变分探索