Large Language Models (LLMs) have shown impressive capabilities across a wide variety of tasks. However, they still face challenges with long-horizon planning. To study this, we propose path planning tasks as a platform to evaluate LLMs' ability to navigate long trajectories under geometric constraints. Our proposed benchmark systematically tests path-planning skills in complex settings. Using this, we examined GPT-4's planning abilities using various task representations and prompting approaches. We found that framing prompts as Python code and decomposing long trajectory tasks improve GPT-4's path planning effectiveness. However, while these approaches show some promise toward improving the planning ability of the model, they do not obtain optimal paths and fail at generalizing over extended horizons.

大型语言模型（LLMs）在各种任务中展示了令人印象深刻的能力，然而它们仍然面临着长期规划的挑战。为了研究这一点，我们提出了路径规划任务作为评估LLMs在几何约束下导航长轨迹能力的平台。我们的基准测试系统地测试了复杂环境中的路径规划技能。使用这个基准测试，我们使用各种任务表示和提示方法来研究GPT-4的规划能力。我们发现将提示框架化为Python代码，并对长期轨迹任务进行分解可以提高GPT-4的路径规划效果。然而，尽管这些方法在改善模型的规划能力方面显示出一些希望，但它们不能获得最优路径，并且无法在较长时间范围内进行泛化。

展望更远：测试 GPT-4 在路径规划中的极限