Modern multi-agent reinforcement learning (RL) algorithms hold great potential for solving a variety of real-world problems. However, they do not fully exploit cross-agent knowledge to reduce sample complexity and improve performance. Although transfer RL supports knowledge sharing, it is hyperparameter sensitive and complex. To solve this problem, we propose a novel multi-agent policy reciprocity (PR) framework, where each agent can fully exploit cross-agent policies even in mismatched states. We then define an adjacency space for mismatched states and design a plug-and-play module for value iteration, which enables agents to infer more precise returns. To improve the scalability of PR, deep PR is proposed for continuous control tasks. Moreover, theoretical analysis shows that agents can asymptotically reach consensus through individual perceived rewards and converge to an optimal value function, which implies the stability and effectiveness of PR, respectively. Experimental results on discrete and continuous environments demonstrate that PR outperforms various existing RL and transfer RL methods.

本文提出了一种新的多智能体策略互惠（PR）框架，其中每个智能体可以在不匹配的状态下充分利用跨智能体策略，并定义了一个不匹配状态的邻接空间并设计一个即插即用模块的值迭代，以提高PR的可扩展性和稳定性，实验证明PR在离散和连续环境中优于现有的各种RL和转移RL方法。

具有理论保证的多智能体策略互惠