vision-and-language navigation (VLN) is a natural language grounding task
where agents have to interpret natural language instructions in the context of
visual scenes in a dynamic environment to achieve prescribe
Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP) integrates natural language and images in instructions, showing improved navigation performance through the use of multi-modal and visual prompts.