Despite the success of single-agent reinforcement learning, multi-agent
reinforcement learning (MARL) remains challenging due to complex interactions
between agents. Motivated by decentralized applications such as sensor
networks, swarm robotics, and power grids, we study policy evaluation in MARL,
where agents with jointly observed state-action pairs and private local rewards
collaborate to learn the value of a given policy. In this paper, we propose a
double averaging scheme, where each agent iteratively performs averaging over
both space and time to incorporate neighboring gradient information and local
reward information, respectively. We prove that the proposed algorithm
converges to the optimal solution at a global geometric rate. In particular,
such an algorithm is built upon a primal-dual reformulation of the mean squared
projected Bellman error minimization problem, which gives rise to a
decentralized convex-concave saddle-point problem. To the best of our
knowledge, the proposed double averaging primal-dual optimization algorithm is
the first to achieve fast finite-time convergence on decentralized
convex-concave saddle-point problems.

该论文提出了一种双重平均方案，其中每个代理迭代地执行平均化，以融合相邻梯度信息和本地奖励信息，解决多智能体强化学习中的政策评估问题，并且实现了分散的凸凹螺旋点问题的快速收敛。