Vision-and-Language Navigation (VLN) stands as a key research problem of
Embodied AI, aiming at enabling agents to navigate in unseen environments
following linguistic instructions. In this field, generalization is a
long-standing challenge, either to out-of-distribution scenes or from Sim to
Real. In this paper, we propose NaVid, a video-based large vision language
model (VLM), to mitigate such a generalization gap. NaVid makes the first
endeavour to showcase the capability of VLMs to achieve state-of-the-art level
navigation performance without any maps, odometer and depth inputs. Following
human instruction, NaVid only requires an on-the-fly video stream from a
monocular RGB camera equipped on the robot to output the next-step action. Our
formulation mimics how humans navigate and naturally gets rid of the problems
introduced by odometer noises, and the Sim2Real gaps from map or depth inputs.
Moreover, our video-based approach can effectively encode the historical
observations of robots as spatio-temporal contexts for decision-making and
instruction following. We train NaVid with 550k navigation samples collected
from VLN-CE trajectories, including action-planning and instruction-reasoning
samples, along with 665k large-scale web data. Extensive experiments show that
NaVid achieves SOTA performance in simulation environments and the real world,
demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our
proposed VLM approach plans the next step for not only the navigation agents
but also this research field.

NaVid 是一个基于视频的大型视觉语言模型，通过动态的视频流输入，无需地图、测距仪和深度信息，实现了最先进水平的导航性能，解决了里程计噪声和模拟环境到真实环境之间的缺陷，同时有效地利用机器人的历史观察作为决策和指令遵循的时空背景，通过对 550k 个导航样本和 665k 个网络数据的训练，在模拟环境和真实世界中取得了非常好的性能，为导航代理和整个研究领域规划了下一步。