The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but -- despite the slew of new private and open source LLMs since GPT3 -- progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs -- making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers -- in a so-called LRM-Modulo system -- guarantees the correctness of the combined system's output while further improving performance.

本研究针对大语言模型（LLM）在规划能力上的不足，提出了一种新的大型推理模型（LRM）o1的评估方法。研究显示，o1在规划与调度基准测试中表现优于传统自回归LLM，但代价较高且无法保证生成结果的正确性。通过将o1模型与外部验证器结合，构建LRM-Modulo系统，可以在提高性能的同时确保输出的正确性。

草莓田中的规划：评估和改进LRM o1的规划与调度能力