embodied agents have achieved prominent performance in following human
instructions to complete tasks. However, the potential of providing
instructions informed by texts and images to assist humans in completing tasks
remains underexplored. To uncover this capability, we present the mu
Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP) integrates natural language and images in instructions, showing improved navigation performance through the use of multi-modal and visual prompts.