Communication is crucial for solving cooperative Multi-Agent Reinforcement
Learning tasks in Partially-Observable Markov Decision Processes. Existing
works often rely on black-box methods to encode local information/features into
messages shared with other agents. However, such black-box approaches are
unable to provide any quantitative guarantees on the expected return and often
lead to the generation of continuous messages with high communication overhead
and poor interpretability. In this paper, we establish an upper bound on the
return gap between an ideal policy with full observability and an optimal
partially-observable policy with discrete communication. This result enables us
to recast multi-agent communication into a novel online clustering problem over
the local observations at each agent, with messages as cluster labels and the
upper bound on the return gap as clustering loss. By minimizing the upper
bound, we propose a surprisingly simple design of message generation functions
in multi-agent communication and integrate it with reinforcement learning using
a Regularized Information Maximization loss function. Evaluations show that the
proposed discrete communication significantly outperforms state-of-the-art
multi-agent communication baselines and can achieve nearly-optimal returns with
few-bit messages that are naturally interpretable.

该论文研究了多智能体强化学习中部分可观察马尔可夫决策过程的沟通问题，提出了通过在线聚类问题将多智能体通信转化为离散通信方式，并结合强化学习使用正则化信息最大化损失函数进行优化，实验证明该方法在多智能体通信中能够以几位比特的自然可解释性消息实现接近最优的回报。

分布式 POMDP 中利用离散通信减小返回差距

Minimizing Return Gaps with Discrete Communications in Decentralized  POMDP

The field of emergent communication aims to understand the characteristics of
communication as it emerges from artificial agents solving tasks that require
information exchange. Communication with discrete messages is considered a
desired characteristic, for both scientific and applied reasons. However,
training a multi-agent system with discrete communication is not
straightforward, requiring either reinforcement learning algorithms or relaxing
the discreteness requirement via a continuous approximation such as the
Gumbel-softmax. Both these solutions result in poor performance compared to
fully continuous communication. In this work, we propose an alternative
approach to achieve discrete communication -- quantization of communicated
messages. Using message quantization allows us to train the model end-to-end,
achieving superior performance in multiple setups. Moreover, quantization is a
natural framework that runs the gamut from continuous to discrete
communication. Thus, it sets the ground for a broader view of multi-agent
communication in the deep learning era.

本研究提出了一种利用消息量化实现离散通信的方法，可以实现优于其他基于强化学习算法或 Gumbel-softmax 的连续逼近的多种设置下的性能，并为深度学习时代下的多代理通信提供更广泛的视角。