Incremental decision making in real-world environments is one of the most
challenging tasks in embodied artificial intelligence. One particularly
demanding scenario is Vision and Language Navigation~(VLN) which requires
visual and natural language understanding as well as spatial and temporal
reasoning capabilities. The embodied agent needs to ground its understanding of
navigation instructions in observations of a real-world environment like Street
View. Despite the impressive results of LLMs in other research areas, it is an
ongoing problem of how to best connect them with an interactive visual
environment. In this work, we propose VELMA, an embodied LLM agent that uses a
verbalization of the trajectory and of visual environment observations as
contextual prompt for the next action. Visual information is verbalized by a
pipeline that extracts landmarks from the human written navigation instructions
and uses CLIP to determine their visibility in the current panorama view. We
show that VELMA is able to successfully follow navigation instructions in
Street View with only two in-context examples. We further finetune the LLM
agent on a few thousand examples and achieve 25%-30% relative improvement in
task completion over the previous state-of-the-art for two datasets.

该研究提出了一个用于视觉和语言导航的具有身体感知的语言模型（VELMA），它能够通过人类书写的导航指令中提取位置信息和使用 CLIP 算法来处理图像信息并实现与真实街景地图的交互，相比先前的研究，在两个数据集中，VELMA 完成任务的成功率相比前者提高了 25％-30％

VELMA：街景视觉语言导航中 LLM 代理人的语言表达体现

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language  Navigation in Street View

This paper presents a new approach for synthesizing a novel street-view
panorama given an overhead satellite image. Taking a small satellite image
patch as input, our method generates a Google's omnidirectional street-view
type panorama, as if it is captured from the same geographical location as the
center of the satellite patch. Existing works tackle this task as an image
generation problem which adopts generative adversarial networks to implicitly
learn the cross-view transformations, while ignoring the domain relevance. In
this paper, we propose to explicitly establish the geometric correspondences
between the two-view images so as to facilitate the cross-view transformation
learning. Specifically, we observe that when a 3D point in the real world is
visible in both views, there is a deterministic mapping between the projected
points in the two-view images given the height information of this 3D point.
Motivated by this, we develop a novel Satellite to Street-view image Projection
(S2SP) module which explicitly establishes such geometric correspondences and
projects the satellite images to the street viewpoint. With these projected
satellite images as network input, we next employ a generator to synthesize
realistic street-view panoramas that are geometrically consistent with the
satellite images. Our S2SP module is differentiable and the whole framework is
trained in an end-to-end manner. Extensive experimental results on two
cross-view benchmark datasets demonstrate that our method generates images that
better respect the scene geometry than existing approaches.

本文提出了一种新的方法，通过建立街景全景图和卫星图像之间的几何对应关系，生成具有新颖性的街景全景图，并展示它在场景几何上的优越性。