Multi-agent path finding (MAPF) is the problem of finding collision-free
paths for a team of agents to reach their goal locations. State-of-the-art
classical MAPF solvers typically employ heuristic search to find solutions for
hundreds of agents but are typically centralized and can struggle to scale when
run with short timeouts. Machine learning (ML) approaches that learn policies
for each agent are appealing as these could enable decentralized systems and
scale well while maintaining good solution quality. Current ML approaches to
MAPF have proposed methods that have started to scratch the surface of this
potential. However, state-of-the-art ML approaches produce "local" policies
that only plan for a single timestep and have poor success rates and
scalability. Our main idea is that we can improve a ML local policy by using
heuristic search methods on the output probability distribution to resolve
deadlocks and enable full horizon planning. We show several model-agnostic ways
to use heuristic search with learnt policies that significantly improve the
policies' success rates and scalability. To our best knowledge, we demonstrate
the first time ML-based MAPF approaches have scaled to high congestion
scenarios (e.g. 20% agent density).

多智能体路径规划是解决一组智能体到达目标位置的无碰撞路径问题。先进的经典多智能体路径规划求解器通常采用启发式搜索方法以找到数百个智能体的解决方案，但通常是集中式的，而在短时间内可能难以扩展。机器学习方法学习每个智能体的策略非常吸引人，因为它们可以实现分散式系统，并在保持良好解决方案质量的同时具备良好的扩展性。我们的主要思想是，我们可以通过使用启发式搜索方法来改进机器学习的局部策略，以解决死锁和实现完全水平的规划。我们展示了几种无模型方法来使用带有学习策略的启发式搜索，这些方法显著提高了策略的成功率和可扩展性。据我们所知，我们首次证明了基于机器学习的多智能体路径规划方法在高拥塞场景（例如，20％智能体密度）中的可扩展性。

优化基于启发式搜索的学习局部 MAPF 策略

Improving Learnt Local MAPF Policies with Heuristic Search

Optimal control deals with optimization problems in which variables steer a
dynamical system, and its outcome contributes to the objective function. Two
classical approaches to solving these problems are Dynamic Programming and the
Pontryagin Maximum Principle. In both approaches, Hamiltonian equations offer
an interpretation of optimality through auxiliary variables known as costates.
However, Hamiltonian equations are rarely used due to their reliance on
forward-backward algorithms across the entire temporal domain. This paper
introduces a novel neural-based approach to optimal control, with the aim of
working forward-in-time. Neural networks are employed not only for implementing
state dynamics but also for estimating costate variables. The parameters of the
latter network are determined at each time step using a newly introduced local
policy referred to as the time-reversed generalized Riccati equation. This
policy is inspired by a result discussed in the Linear Quadratic (LQ) problem,
which we conjecture stabilizes state dynamics. We support this conjecture by
discussing experimental results from a range of optimal control case studies.

本文介绍了一种新颖的基于神经网络的最优控制方法，该方法旨在通过在时间上前进来实现最优控制。神经网络不仅用于实施状态动态，还用于估计共轭变量。通过引入一种新的局部策略，即时间反转的广义 Riccati 方程，来确定后一网络的参数。通过讨论一系列最优控制案例研究的实验结果，我们支持这个猜想，即这个策略能够稳定状态动态。