Several multiagent reinforcement learning (MARL) algorithms have been proposed to optimize agents decisions. Due to the complexity of the problem, the majority of the previously developed MARL algorithms assumed agents either had some knowledge of the underlying game (such as Nash equilibria) and/or observed other agents actions and the rewards they received. We introduce a new MARL algorithm called the Weighted Policy Learner (WPL), which allows agents to reach a Nash Equilibrium (NE) in benchmark 2-player-2-action games with minimum knowledge. Using WPL, the only feedback an agent needs is its own local reward (the agent does not observe other agents actions or rewards). Furthermore, WPL does not assume that agents know the underlying game or the corresponding Nash Equilibrium a priori. We experimentally show that our algorithm converges in benchmark two-player-two-action games. We also show that our algorithm converges in the challenging Shapleys game where previous MARL algorithms failed to converge without knowing the underlying game or the NE. Furthermore, we show that WPL outperforms the state-of-the-art algorithms in a more realistic setting of 100 agents interacting and learning concurrently. An important aspect of understanding the behavior of a MARL algorithm is analyzing the dynamics of the algorithm: how the policies of multiple learning agents evolve over time as agents interact with one another. Such an analysis not only verifies whether agents using a given MARL algorithm will eventually converge, but also reveals the behavior of the MARL algorithm prior to convergence. We analyze our algorithm in two-player-two-action games and show that symbolically proving WPLs convergence is difficult, because of the non-linear nature of WPLs dynamics, unlike previous MARL algorithms that had either linear or piece-wise-linear dynamics. Instead, we numerically solve WPLs dynamics differential equations and compare the solution to the dynamics of previous MARL algorithms.

使用加权策略学习器（Weighted Policy Learner）算法，基于本地奖励的反馈，实现了多智能体强化学习（MARL）算法在二人二选手博弈中寻找Nash Equilibrium的能力。与之前的算法相比，WPL不需要观察其他智能体动作和奖励，也不需要预先了解博弈本质和NE解，收敛表现优于现有的算法，并且在100个智能体交互中并行收敛。通过对WPL的动力学分析，可以更好地理解该算法的行为，分析WPL的收敛性比较困难，需要数值模拟求解动力学微分方程来验证其收敛性。

具有非线性动力学的多智能体强化学习算法