Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

本研究解决了大型语言模型和视觉语言模型在需要同时进行规划和感知时，因融合多模态信息而导致表现不佳的问题。提出了一种基于单张环境图像的多智能体架构，利用常识知识进行自由形式域的处理，并引入新的全自动评估程序PG2S，以更好地评估规划质量。研究表明，该方法在使用ALFRED数据集时优于现有的KAS指标。

基于视觉语言模型的多智能体规划