Multimodal Large Language Models (MLLMs), building upon the powerful Large Language Models (LLMs) with exceptional reasoning and generalization capability, have opened up new avenues for embodied task planning. MLLMs excel in their ability to integrate diverse environmental inputs, such as real-time task progress, visual observations, and open-form language instructions, which are crucial for executable task planning. In this work, we introduce a benchmark with human annotations, EgoPlan-Bench, to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied environments. We evaluate various open-source MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V). We further construct an instruction-tuning dataset EgoPlan-IT from videos of human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but also effectively acts as embodied planner in simulations.

多模态大型语言模型（MLLMs）在具有出色推理和概括能力的大型语言模型（LLMs）基础上开辟了新的具身任务规划途径。我们引入了一项人类注释的基准测试EgoPlan-Bench，定量调查MLLMs在现实场景中作为具身任务规划器的潜力，并构建了一个指导调优数据集EgoPlan-IT，这些实验结果表明，通过EgoPlan-IT调优的模型不仅在我们的基准测试中显著提高了性能，还在模拟中有效地扮演了具身规划器的角色。

EgoPlan-Bench：基于多模态大型语言模型的自我中心体验规划基准测试