Recent advances in multi-agent reinforcement learning (MARL) have achieved
super-human performance in games like Quake 3 and Dota 2. Unfortunately, these
techniques require orders-of-magnitude more training rounds than humans and
don't generalize to new agent configurations even on the same game. In this
work, we propose Collaborative Q-learning (CollaQ) that achieves
state-of-the-art performance in the StarCraft multi-agent challenge and
supports ad hoc team play. We first formulate multi-agent collaboration as a
joint optimization on reward assignment and show that each agent has an
approximately optimal policy that decomposes into two parts: one part that only
relies on the agent's own state, and the other part that is related to states
of nearby agents. Following this novel finding, CollaQ decomposes the
Q-function of each agent into a self term and an interactive term, with a
Multi-Agent Reward Attribution (MARA) loss that regularizes the training.
CollaQ is evaluated on various StarCraft maps and shows that it outperforms
existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving
the win rate by 40% with the same number of samples. In the more challenging ad
hoc team play setting (i.e., reweight/add/remove units without re-training or
finetuning), CollaQ outperforms previous SoTA by over 30%.

本文提出了一种名为 Collaborative Q-learning (CollaQ) 的多智能体协作强化学习算法，它利用 Multi-Agent Reward Attribution (MARA) loss 进行训练并在 StarCraft 多智能体挑战中表现出色，尤其支持 ad hoc 团队玩法。该算法能将每个智能体的 Q 函数分解为自表达项和交互项，并在无需重新训练 / 微调的情况下，显著提高 SoTA 超过 30%。