The min-max vehicle routing problem (min-max VRP) traverses all given
customers by assigning several routes and aims to minimize the length of the
longest route. Recently, reinforcement learning (RL)-based sequential planning
methods have exhibited advantages in solving efficiency and optimality.
However, these methods fail to exploit the problem-specific properties in
learning representations, resulting in less effective features for decoding
optimal routes. This paper considers the sequential planning process of min-max
VRPs as two coupled optimization tasks: customer partition for different routes
and customer navigation in each route (i.e., partition and navigation). To
effectively process min-max VRP instances, we present a novel attention-based
Partition-and-Navigation encoder (P&N Encoder) that learns distinct embeddings
for partition and navigation. Furthermore, we utilize an inherent symmetry in
decoding routes and develop an effective agent-permutation-symmetric (APS) loss
function. Experimental results demonstrate that the proposed
Decoupling-Partition-Navigation (DPN) method significantly surpasses existing
learning-based methods in both single-depot and multi-depot min-max VRPs. Our
code is available at

通过提出了一个新颖的注意力机制启发式学习的编码器，以及一种有效的分解 - 分区 - 导航（Decoupling-Partition-Navigation）方法，本论文显著超越了现有的学习方法在单车库和多车库的最小 - 最大车辆路径规划问题中。

DPN: 分离划分和导航的神经求解器：最大最小车辆路径问题

DPN: Decoupling Partition and Navigation for Neural Solvers of Min-max  Vehicle Routing Problems

Sequential planning in large state space and action space quickly becomes
intractable due to combinatorial explosion of the search space. Heuristic
methods, like monte-carlo tree search, though effective for large state space,
but struggle if action space is large. Pure reinforcement learning methods,
relying only on reward signals, needs prohibitively large interactions with the
environment to device a viable plan. If the state space, observations and
actions can be represented in natural language then Large Language models (LLM)
can be used to generate action plans. Recently several such goal-directed
agents like Reflexion, CLIN, SayCan were able to surpass the performance of
other state-of-the-art methods with minimum or no task specific training. But
they still struggle with exploration and get stuck in local optima. Their
planning capabilities are limited by the limited reasoning capability of the
foundational LLMs on text data. We propose a hybrid agent "neoplanner", that
synergizes both state space search with queries to foundational LLM to get the
best action plan. The reward signals are quantitatively used to drive the
search. A balance of exploration and exploitation is maintained by maximizing
upper confidence bounds of values of states. In places where random exploration
is needed, the LLM is queried to generate an action plan. Learnings from each
trial are stored as entity relationships in text format. Those are used in
future queries to the LLM for continual improvement. Experiments in the
Scienceworld environment reveals a 124% improvement from the current best
method in terms of average reward gained across multiple tasks.

通过结合状态空间搜索和基于自然语言模型的查询，我们提出了一种混合代理方法 neoplanner，以最大化状态值的上界来平衡探索和开发，并通过查询自然语言模型以生成行动计划，进一步提高了大规模状态空间和行动空间的顺序规划的性能。

大型部分可观察环境中的顺序计划引导 LLMs

Sequential Planning in Large Partially Observable Environments guided by  LLMs

Standard planners for sequential decision making (including Monte Carlo
planning, tree search, dynamic programming, etc.) are constrained by an
implicit sequential planning assumption: The order in which a plan is
constructed is the same in which it is executed. We consider alternatives to
this assumption for the class of goal-directed Reinforcement Learning (RL)
problems. Instead of an environment transition model, we assume an imperfect,
goal-directed policy. This low-level policy can be improved by a plan,
consisting of an appropriate sequence of sub-goals that guide it from the start
to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte
Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of
proposing intermediate sub-goals which hierarchically partition the initial
tasks into simpler ones that are then solved independently and recursively. The
algorithm critically makes use of a learned sub-goal proposal for finding
appropriate partitions trees of new tasks based on prior experience. Different
strategies for learning sub-goal proposals give rise to different planning
strategies that strictly generalize sequential planning. We show that this
algorithmic flexibility over planning order leads to improved results in
navigation tasks in grid-worlds as well as in challenging continuous control
environments.

提出了一种名为 DC-MCTS 的计划算法，用于解决目标导向的强化学习问题，该算法通过给出中间子目标来逐步划分初始任务，并独立递归地解决更简单的任务，从而实现改进策略， 使规划顺序具有灵活性，得到了在格子世界和各种连续控制环境中的强大表现。