Communication is crucial for solving cooperative Multi-Agent Reinforcement
Learning tasks in Partially-Observable Markov Decision Processes. Existing
works often rely on black-box methods to encode local information/features into
messages shared with other agents. However, such black-box approaches are
unable to provide any quantitative guarantees on the expected return and often
lead to the generation of continuous messages with high communication overhead
and poor interpretability. In this paper, we establish an upper bound on the
return gap between an ideal policy with full observability and an optimal
partially-observable policy with discrete communication. This result enables us
to recast multi-agent communication into a novel online clustering problem over
the local observations at each agent, with messages as cluster labels and the
upper bound on the return gap as clustering loss. By minimizing the upper
bound, we propose a surprisingly simple design of message generation functions
in multi-agent communication and integrate it with reinforcement learning using
a Regularized Information Maximization loss function. Evaluations show that the
proposed discrete communication significantly outperforms state-of-the-art
multi-agent communication baselines and can achieve nearly-optimal returns with
few-bit messages that are naturally interpretable.

该论文研究了多智能体强化学习中部分可观察马尔可夫决策过程的沟通问题，提出了通过在线聚类问题将多智能体通信转化为离散通信方式，并结合强化学习使用正则化信息最大化损失函数进行优化，实验证明该方法在多智能体通信中能够以几位比特的自然可解释性消息实现接近最优的回报。