Efficient exploration in deep cooperative multi-agent reinforcement learning
(MARL) still remains challenging in complex coordination problems. In this
paper, we introduce a novel Episodic Multi-agent reinforcement learning with
Curiosity-driven exploration, called EMC. We leverage an insight of popular
factorized MARL algorithms that the "induced" individual Q-values, i.e., the
individual utility functions used for local execution, are the embeddings of
local action-observation histories, and can capture the interaction between
agents due to reward backpropagation during centralized training. Therefore, we
use prediction errors of individual Q-values as intrinsic rewards for
coordinated exploration and utilize episodic memory to exploit explored
informative experience to boost policy training. As the dynamics of an agent's
individual Q-value function captures the novelty of states and the influence
from other agents, our intrinsic reward can induce coordinated exploration to
new or promising states. We illustrate the advantages of our method by didactic
examples, and demonstrate its significant outperformance over state-of-the-art
MARL baselines on challenging tasks in the StarCraft II micromanagement
benchmark.

本篇论文提出了 Episodic Multi-agent 强化学习方法，并把个体 Q 值预测误差作为内部奖励，使用情节式记忆从经验中提升策略训练，从而实现多代理协作性问题的有效探索和高效学习。在 StarCraft II 微型管理基准测试中，我们的方法显著优于现有情况下的 MARL 基线。