Multi-agent path finding (MAPF) is an essential component of many
large-scale, real-world robot deployments, from aerial swarms to warehouse
automation. However, despite the community's continued efforts, most
state-of-the-art MAPF planners still rely on centralized planning and scale
poorly past a few hundred agents. Such planning approaches are maladapted to
real-world deployments, where noise and uncertainty often require paths be
recomputed online, which is impossible when planning times are in seconds to
minutes. We present PRIMAL, a novel framework for MAPF that combines
reinforcement and imitation learning to teach fully-decentralized policies,
where agents reactively plan paths online in a partially-observable world while
exhibiting implicit coordination. This framework extends our previous work on
distributed learning of collaborative policies by introducing demonstrations of
an expert MAPF planner during training, as well as careful reward shaping and
environment sampling. Once learned, the resulting policy can be copied onto any
number of agents and naturally scales to different team sizes and world
dimensions. We present results on randomized worlds with up to 1024 agents and
compare success rates against state-of-the-art MAPF planners. Finally, we
experimentally validate the learned policies in a hybrid simulation of a
factory mockup, involving both real-world and simulated robots.

该研究提出了 PRIMAL，一种新的多智能体路径规划框架，结合了强化学习和模仿学习，用于训练全分散策略，在部分可观测的环境中在线反应式规划路径，具有隐式协调性。该框架通过引入专家规划器的演示，细心的奖励重新塑造和环境抽样，扩展了以前我们在协作策略的分布式学习方面的工作。最终，该研究还在模拟机器人实验中验证了所学策略的性能。

PRIMAL: 通过强化学习和模仿多智能体学习进行路径规划

PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning

Patriksson (2008) provided a then up-to-date survey on the
continuous,separable, differentiable and convex resource allocation problem
with a single resource constraint. Since the publication of that paper the
interest in the problem has grown: several new applications have arisen where
the problem at hand constitutes a subproblem, and several new algorithms have
been developed for its efficient solution. This paper therefore serves three
purposes. First, it provides an up-to-date extension of the survey of the
literature of the field, complementing the survey in Patriksson (2008) with
more then 20 books and articles. Second, it contributes improvements of some of
these algorithms, in particular with an improvement of the pegging (that is,
variable fixing) process in the relaxation algorithm, and an improved means to
evaluate subsolutions. Third, it numerically evaluates several relaxation
(primal) and breakpoint (dual) algorithms, incorporating a variety of pegging
strategies, as well as a quasi-Newton method. Our conclusion is that our
modification of the relaxation algorithm performs the best. At least for
problem sizes up to 30 million variables the practical time complexity for the
breakpoint and relaxation algorithms is linear.

本文对单一资源约束下连续可分可微凸资源分配问题进行了最新的整理与扩展，包括 20 多本书籍和文章；还对算法进行了改进，并对多种松弛（主元）和断点（对偶）算法进行了数值评估，得出算法较好的结论。