Household environments are visually diverse. Embodied agents performing
Vision-and-Language Navigation (VLN) in the wild must be able to handle this
diversity, while also following arbitrary language instructions. Recently,
Vision-Language models like CLIP have shown great performance on the task of
zero-shot object recognition. In this work, we ask if these models are also
capable of zero-shot language grounding. In particular, we utilize CLIP to
tackle the novel problem of zero-shot VLN using natural language referring
expressions that describe target objects, in contrast to past work that used
simple language templates describing object classes. We examine CLIP's
capability in making sequential navigational decisions without any
dataset-specific finetuning, and study how it influences the path that an agent
takes. Our results on the coarse-grained instruction following task of REVERIE
demonstrate the navigational capability of CLIP, surpassing the supervised
baseline in terms of both success rate (SR) and success weighted by path length
(SPL). More importantly, we quantitatively show that our CLIP-based zero-shot
approach generalizes better to show consistent performance across environments
when compared to SOTA, fully supervised learning approaches when evaluated via
Relative Change in Success (RCS).

本研究主要探讨利用 CLIP 模型在零样本情况下，通过描述目标对象的自然语言参考表达式来解决零样本视觉语言导航问题，并在 REVERIE 数据集上比较 CLIP 模型和监督学习模型的性能。结果显示，采用 CLIP 零样本方法的导航能力优于基于模板的监督学习方法，并且在相对成功率（RCS）方面具有更好的泛化性能。