Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

本研究解决了在从文本描述生成符号世界模型时，大型语言模型（LLM）存在的评估随机性、依赖间接指标和领域范围有限等问题。我们推出了一个新的基准Text2World，采用多标准、基于执行的评估方法，发现经过大规模强化学习训练的推理模型表现优于其他模型，但即便是最优秀的模型在世界建模方面仍能力有限。我们探索了包括测试时扩展和代理训练在内的多种策略，以期提高LLM的世界建模能力。

Text2World：大型语言模型符号世界模型生成的基准测试