Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning.

本研究针对大型语言模型在规划能力方面的不足，评估了OpenAI的o1模型在多项基准任务中的表现，重点关注可行性、最优性和可推广性。研究发现，虽然后者在遵循任务约束方面优于GPT-4，但在空间复杂任务中的泛化能力和决策记忆管理仍存在瓶颈，为未来提升语言模型的规划能力提供了重要方向。

关于OpenAI的o1模型的规划能力：可行性、最优性和可推广性