We show by counterexample that policy-gradient algorithms have no guarantees
of even local convergence to Nash equilibria in continuous action and state
space multi-agent settings. To do so, we analyze gradient-play in N-player
general-sum linear quadratic games, a classic game setting which is recently
emerging as a benchmark in the field of multi-agent learning. In such games the
state and action spaces are continuous and global Nash equilibria can be found
be solving coupled Ricatti equations. Further, gradient-play in LQ games is
equivalent to multi agent policy-gradient. We first show that these games are
surprisingly not convex games. Despite this, we are still able to show that the
only critical points of the gradient dynamics are global Nash equilibria. We
then give sufficient conditions under which policy-gradient will avoid the Nash
equilibria, and generate a large number of general-sum linear quadratic games
that satisfy these conditions. In such games we empirically observe the players
converging to limit cycles for which the time average does not coincide with a
Nash equilibrium. The existence of such games indicates that one of the most
popular approaches to solving reinforcement learning problems in the classic
reinforcement learning setting has no local guarantee of convergence in
multi-agent settings. Further, the ease with which we can generate these
counterexamples suggests that such situations are not mere edge cases and are
in fact quite common.

本文章主要对多智能体马尔可夫决策过程中的政策梯度算法进行研究，经由分析线性二次博弈的梯度播放，得到该算法并不存在全局收敛到 Nash 平衡点的保证，且通过实验发现此类情况并不少见。

策略梯度算法在线性二次博弈中没有收敛保证

Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear  Quadratic Games

We study the global convergence of policy optimization for finding the Nash
equilibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we first
investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave
saddle-point problem in the policy space. Specifically, we show that despite
its nonconvexity and nonconcavity, zero-sum LQ games have the property that the
stationary point of the objective function with respect to the linear feedback
control policies constitutes the NE of the game. Building upon this, we develop
three projected nested-gradient methods that are guaranteed to converge to the
NE of the game. Moreover, we show that all of these algorithms enjoy both
globally sublinear and locally linear convergence rates. Simulation results are
also provided to illustrate the satisfactory convergence properties of the
algorithms. To the best of our knowledge, this work appears to be the first one
to investigate the optimization landscape of LQ games, and provably show the
convergence of policy optimization methods to the Nash equilibria. Our work
serves as an initial step toward understanding the theoretical aspects of
policy-based reinforcement learning algorithms for zero-sum Markov games in
general.

研究线性二次游戏中政策优化寻找纳什均衡的全局收敛性，开发了三种投影嵌套 - 梯度方法并给出了满意的收敛性证明和模拟结果，是对零和 Markov 博弈政策优化强化学习算法理论方面的探索。