May, 2024
通过大型模型进行视觉语言导航中的可纠正的地标发现
Correctable Landmark Discovery via Large Models for Vision-Language Navigation
Bingqian Lin, Yunshuang Nie, Ziming Wei, Yi Zhu, Hang Xu...
TL;DRVision-Language Navigation (VLN) requires the agent to align landmarks based on instruction and visual observations. This paper proposes CONSOLE, a new paradigm that treats VLN as an open-world landmark discovery problem, utilizing large models ChatGPT and CLIP for accurate alignment and observation enhancement to achieve state-of-the-art results on multiple VLN benchmarks.