Intrigued by the claims of emergent reasoning capabilities in LLMs trained on
general web corpora, in this paper, we set out to investigate their planning
capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating
plans autonomously in commonsense planning tasks and (2) the potential of LLMs
as a source of heuristic guidance for other agents (AI planners) in their
planning tasks. We conduct a systematic study by generating a suite of
instances on domains similar to the ones employed in the International Planning
Competition and evaluate LLMs in two distinct modes: autonomous and heuristic.
Our findings reveal that LLMs' ability to generate executable plans
autonomously is rather limited, with the best model (GPT-4) having an average
success rate of ~12% across the domains. However, the results in the heuristic
mode show more promise. In the heuristic mode, we demonstrate that
LLM-generated plans can improve the search process for underlying sound
planners and additionally show that external verifiers can help provide
feedback on the generated plans and back-prompt the LLM for better plan
generation.

本文旨在研究 LLLms 在常识规划任务中的规划能力，通过在国际计划竞赛中生成一系列实例，并评估 LLMs 在自主规划和启发式两种不同模式下的表现，发现 LLMs 在自主规划方面的表现非常有限，但在启发式模式下，LLMs 生成的计划可以改善其它智能计划器的搜索过程并提供反馈以进一步验证计划质量。

大型语言模型的规划能力 - 一项关键调查

On the Planning Abilities of Large Language Models -- A Critical  Investigation

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on
general web corpora, in this paper, we set out to investigate their planning
capabilities. We aim to evaluate (1) how good LLMs are by themselves in
generating and validating simple plans in commonsense planning tasks (of the
type that humans are generally quite good at) and (2) how good LLMs are in
being a source of heuristic guidance for other agents--either AI planners or
human planners--in their planning tasks. To investigate these questions in a
systematic rather than anecdotal manner, we start by developing a benchmark
suite based on the kinds of domains employed in the International Planning
Competition. On this benchmark, we evaluate LLMs in three modes: autonomous,
heuristic and human-in-the-loop. Our results show that LLM's ability to
autonomously generate executable plans is quite meager, averaging only about 3%
success rate. The heuristic and human-in-the-loop modes show slightly more
promise. In addition to these results, we also make our benchmark and
evaluation tools available to support investigations by research community.

研究了通用 Web 语料库上训练的语言模型的计划能力，开发了基于国际计划竞赛领域的基准套件，在自治、启发式和人机协作模式下对 LLM 进行了评估，发现自主生成可执行计划的能力非常有限，只有约 3% 的成功率。