Navigating in unseen environments is crucial for mobile robots. Enhancing
them with the ability to follow instructions in natural language will further
improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA)
vision-and-language navigation (VLN) methods are mainly evaluated in
simulation, neglecting the complex and noisy real world. Directly transferring
SOTA navigation policies trained in simulation to the real world is challenging
due to the visual domain gap and the absence of prior knowledge about unseen
environments. In this work, we propose a novel navigation framework to address
the VLN task in the real world. Utilizing the powerful foundation models, the
proposed framework includes four key components: (1) an LLMs-based instruction
parser that converts the language instruction into a sequence of pre-defined
macro-action descriptions, (2) an online visual-language mapper that builds a
real-time visual-language map to maintain a spatial and semantic understanding
of the unseen environment, (3) a language indexing-based localizer that grounds
each macro-action description into a waypoint location on the map, and (4) a
DD-PPO-based local controller that predicts the action. We evaluate the
proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
Without any fine-tuning, our pipeline significantly outperforms the SOTA VLN
baseline in the real world.

在本文中，我们提出了一个在真实世界中解决 VLN 任务的新型导航框架，该框架利用强大的基础模型，并包括四个关键组成部分：(1) 将语言指令转换为预定义的宏操作描述的 LLMs-based 指令解析器，(2) 构建实时的视觉 - 语言地图以保持对未知环境的空间和语义理解的在线视觉 - 语言映射器，(3) 基于语言索引的定位器，将每个宏操作描述重新映射到地图上的路径点位置，以及 (4) 基于 DD-PPO 的本地控制器，用于预测动作。我们在未知的实验室环境中使用 Interbotix LoCoBot WX250 对提出的流程进行了评估，而无需进行任何细微调整，在真实世界中，我们的流程明显优于 SOTA VLN 基线。

基于在线视觉语言映射的真实世界视觉语言导航

Vision and Language Navigation in the Real World via Online Visual  Language Mapping

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a
practical yet challenging problem in which an agent learns to navigate
following a path described by language instructions without requiring any
path-instruction annotation data. Normally, the instructions have complex
grammatical structures and often contain various action descriptions (e.g.,
"proceed beyond", "depart from"). How to correctly understand and execute these
action demands is a critical problem, and the absence of annotated data makes
it even more challenging. Note that a well-educated human being can easily
understand path instructions without the need for any special training. In this
paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting
the vision-and-language ability of foundation models. Specifically, the
proposed method consists of an instruction parser and an action-aware
navigation policy. The instruction parser utilizes the advanced reasoning
ability of large language models (e.g., GPT-3) to decompose complex navigation
instructions into a sequence of action-specific object navigation sub-tasks.
Each sub-task requires the agent to localize the object and navigate to a
specific goal position according to the associated action demand. To accomplish
these sub-tasks, an action-aware navigation policy is learned from freely
collected action-specific datasets that reveal distinct characteristics of each
action demand. We use the learned navigation policy for executing sub-tasks
sequentially to follow the navigation instruction. Extensive experiments show
$A^2$Nav achieves promising ZS-VLN performance and even surpasses the
supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

我们提出了一种基于动作感知的零样本图像与语言导航（ZS-VLN）方法（$A^2$Nav），通过利用基础模型的视觉和语言能力，将复杂的导航指令分解为一系列具有特定动作要求的对象导航子任务，然后学习一个由已收集到的具有不同特征的动作数据集构建的动作感知导航策略，以便按顺序执行这些子任务，从而实现导航指令的完整执行。实验证明，$A^2$Nav 在零样本图像与语言导航方面具有很好的性能，并且在 R2R-Habitat 和 RxR-Habitat 数据集上甚至超过了监督学习方法。