Recent years have witnessed an emerging paradigm shift toward embodied
artificial intelligence, in which an agent must learn to solve challenging
tasks by interacting with its environment. There are several challenges in
solving embodied multimodal tasks, including long-horizon planning,
vision-and-language grounding, and efficient exploration. We focus on a
critical bottleneck, namely the performance of planning and navigation. To
tackle this challenge, we propose a Neural SLAM approach that, for the first
time, utilizes several modalities for exploration, predicts an affordance-aware
semantic map, and plans over it at the same time. This significantly improves
exploration efficiency, leads to robust long-horizon planning, and enables
effective vision-and-language grounding. With the proposed Affordance-aware
Multimodal Neural SLAM (AMSLAM) approach, we obtain more than 40% improvement
over prior published work on the ALFRED benchmark and set a new
state-of-the-art generalization performance at a success rate of 23.48% on the
test unseen scenes.

提出一种神经 SLAM 方法，利用多种模态进行探索，预测可承受意义地图并在其上进行规划，从而显著提高了探索效率，实现了鲁棒的长程规划，使得机器智能能够更有效地识别视觉和语言信息。在 ALFRED 基准测试中，相对先前发表的作品，提出的 Affordance-aware Multimodal Neural SLAM（AMSLAM）方法在成功率上实现了 23.48％的新的最高水平，取得了超过 40％的改进。

具有可操作感知的多模态神经 SLAM 学习行为

Learning to Act with Affordance-Aware Multimodal Neural SLAM

In vision-and-language grounding problems, fine-grained representations of
the image are considered to be of paramount importance. Most of the current
systems incorporate visual features and textual concepts as a sketch of an
image. However, plainly inferred representations are usually undesirable in
that they are composed of separate components, the relations of which are
elusive. In this work, we aim at representing an image with a set of integrated
visual regions and corresponding textual concepts, reflecting certain
semantics. To this end, we build the Mutual Iterative Attention (MIA) module,
which integrates correlated visual features and textual concepts, respectively,
by aligning the two modalities. We evaluate the proposed approach on two
representative vision-and-language grounding tasks, i.e., image captioning and
visual question answering. In both tasks, the semantic-grounded image
representations consistently boost the performance of the baseline models under
all metrics across the board. The results demonstrate that our approach is
effective and generalizes well to a wide range of models for image-related
applications. (The code is available at this https URL)

该论文的研究旨在使用一组集成的视觉区域和相应的文本概念来表示图像，从而反映出特定的语义。为此，研究人员构建了互相迭代注意力（MIA）模块，并将该方法在图像字幕和视觉问答等任务中得到了验证。结果表明，该方法对于图像相关应用具有广泛的泛化能力，并且能将基线模型的性能提升到了一个新的水平。