Standard cooperative multi-agent reinforcement learning (MARL) methods aim to find the optimal team cooperative policy to complete a task. However there may exist multiple different ways of cooperating, which usually are very needed by domain experts. Therefore, identifying a set of significantly different policies can alleviate the task complexity for them. Unfortunately, there is a general lack of effective policy diversity approaches specifically designed for the multi-agent domain. In this work, we propose a method called Moment-Matching Policy Diversity to alleviate this problem. This method can generate different team policies to varying degrees by formalizing the difference between team policies as the difference in actions of selected agents in different policies. Theoretically, we show that our method is a simple way to implement a constrained optimization problem that regularizes the difference between two trajectory distributions by using the maximum mean discrepancy. The effectiveness of our approach is demonstrated on a challenging team-based shooter.

标准的多智能体强化学习方法旨在找到完成任务的最优团队合作策略。然而，在不同的合作方式中可能存在多种选择，这往往极大地增加了领域专家的任务复杂性。因此，我们提出了一种名为Moment-Matching Policy Diversity的方法，该方法通过形式化不同策略所选智能体的行为差异来生成不同的团队策略。理论上，我们证明了该方法是通过使用最大均值差异来实现约束优化问题的简单方式。我们的方法的有效性在一个具有挑战性的基于团队的射击游戏中得到了验证。

合作智能体的政策多样性