While large language models (LLMs) excel in a simulated world of texts, they
struggle to interact with the more realistic world without perceptions of other
modalities such as visual or audio signals. Although vision-language models
(VLMs) integrate LLM modules (1) aligned with static image features, and (2)
may possess prior knowledge of world dynamics (as demonstrated in the text
world), they have not been trained in an embodied visual world and thus cannot
align with its dynamics. On the other hand, training an embodied agent in a
noisy visual world without expert guidance is often challenging and
inefficient. In this paper, we train a VLM agent living in a visual world using
an LLM agent excelling in a parallel text world (but inapplicable to the visual
world). Specifically, we distill LLM's reflection outcomes (improved actions by
analyzing mistakes) in a text world's tasks to finetune the VLM on the same
tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA)
quickly adapting to the visual world dynamics. Such cross-modality imitation
learning between the two parallel worlds enables EMMA to generalize to a broad
scope of new tasks without any further guidance from the LLM expert. Extensive
evaluations on the ALFWorld benchmark highlight EMMA's superior performance to
SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the
success rate.

我们通过在文本世界的任务中，将大型语言模型（LLMs）的反思结果（通过分析错误改进的行为）融入到视觉世界的相同任务中来训练一个居住在视觉世界的视觉语言模型（VLM）代理，从而使得这个多模态的具身代理（EMMA）能够快速适应视觉世界的动态，并在 ALFWorld 基准测试中表现出优越的性能。