Decentralized online planning can be an attractive paradigm for cooperative
multi-agent systems, due to improved scalability and robustness. A key
difficulty of such approach lies in making accurate predictions about the
decisions of other agents. In this paper, we present a trainable online
decentralized planning algorithm based on decentralized Monte Carlo Tree
Search, combined with models of teammates learned from previous episodic runs.
By only allowing one agent to adapt its models at a time, under the assumption
of ideal policy approximation, successive iterations of our method are
guaranteed to improve joint policies, and eventually lead to convergence to a
Nash equilibrium. We test the efficiency of the algorithm by performing
experiments in several scenarios of the spatial task allocation environment
introduced in [Claes et al., 2015]. We show that deep learning and
convolutional neural networks can be employed to produce accurate policy
approximators which exploit the spatial features of the problem, and that the
proposed algorithm improves over the baseline planning performance for
particularly challenging domain configurations.

本文提出一种可训练的在线分散式规划算法，基于分散蒙特卡洛树搜索，结合先前的剧集运行学习的队友模型，利用深度学习和卷积神经网络生成精确的策略逼近器，提高了策划性能。此算法支持去中心化在线规划的多代理系统.

通过学习队友模型实现的分散 MCTS

Decentralized MCTS via Learned Teammate Models

Consider the problem of approximating the optimal policy of a Markov decision
process (MDP) by sampling state transitions. In contrast to existing
reinforcement learning methods that are based on successive approximations to
the nonlinear Bellman equation, we propose a Primal-Dual $\pi$ Learning method
in light of the linear duality between the value and policy. The $\pi$ learning
method is model-free and makes primal-dual updates to the policy and value
vectors as new data are revealed. For infinite-horizon undiscounted Markov
decision process with finite state space $S$ and finite action space $A$, the
$\pi$ learning method finds an $\epsilon$-optimal policy using the following
number of sample transitions $$ \tilde{O}( \frac{(\tau\cdot t^*_{mix})^2 |S|
|A| }{\epsilon^2} ),$$ where $t^*_{mix}$ is an upper bound of mixing times
across all policies and $\tau$ is a parameter characterizing the range of
stationary distributions across policies. The $\pi$ learning method also
applies to the computational problem of MDP where the transition probabilities
and rewards are explicitly given as the input. In the case where each state
transition can be sampled in $\tilde{O}(1)$ time, the $\pi$ learning method
gives a sublinear-time algorithm for solving the averaged-reward MDP.

本文提出了一种基于 Primal-Dual π Learning 的方法，利用线性对偶性更新价值与策略向量以逼近无穷时间和折扣因子为 1 的马尔可夫决策过程的最优策略，并给出了复杂度上界，并且这种方法还能应用于有限状态、有限动作空间以及随机转移概率模型下的计算问题，当情况许可下，此方法可以在次线性时间内解决平均奖励最大化的问题。

原始 - 对偶 π 学习：对遍历式马尔可夫决策问题的样本复杂度和亚线性运行时间

Primal-Dual $π$ Learning: Sample Complexity and Sublinear Run Time for  Ergodic Markov Decision Problems

Standard value function approaches to finding policies for Partially
Observable Markov Decision Processes (POMDPs) are generally considered to be
intractable for large models. The intractability of these algorithms is to a
large extent a consequence of computing an exact, optimal policy over the
entire belief space. However, in real-world POMDP problems, computing the
optimal policy for the full belief space is often unnecessary for good control
even for problems with complicated policy classes. The beliefs experienced by
the controller often lie near a structured, low-dimensional subspace embedded
in the high-dimensional belief space. Finding a good approximation to the
optimal value function for only this subspace can be much easier than computing
the full value function. We introduce a new method for solving large-scale
POMDPs by reducing the dimensionality of the belief space. We use Exponential
family Principal Components Analysis (Collins, Dasgupta and Schapire, 2002) to
represent sparse, high-dimensional belief spaces using small sets of learned
features of the belief state. We then plan only in terms of the low-dimensional
belief features. By planning in this low-dimensional space, we can find
policies for POMDP models that are orders of magnitude larger than models that
can be handled by conventional techniques. We demonstrate the use of this
algorithm on a synthetic problem and on mobile robot navigation tasks.

该研究提出了一种用于解决大型部分观察马尔可夫决策过程（POMDPs）的算法，通过降低置信度空间的维度来进行策略逼近，其中采用了指数族主成分分析方法，并且该算法成功地应用于合成问题和移动机器人导航任务中。