May, 2024

通过大型模型进行视觉语言导航中的可纠正的地标发现

TL;DRVision-Language Navigation (VLN) requires the agent to align landmarks based on instruction and visual observations. This paper proposes CONSOLE, a new paradigm that treats VLN as an open-world landmark discovery problem, utilizing large models ChatGPT and CLIP for accurate alignment and observation enhancement to achieve state-of-the-art results on multiple VLN benchmarks.