Embodied AI is one of the most popular studies in artificial intelligence and
robotics, which can effectively improve the intelligence of real-world agents
(i.e. robots) serving human beings. Scene knowledge is important for an agent
to understand the surroundings and make correct decisions in the varied open
world. Currently, knowledge base for embodied tasks is missing and most
existing work use general knowledge base or pre-trained models to enhance the
intelligence of an agent. For conventional knowledge base, it is sparse,
insufficient in capacity and cost in data collection. For pre-trained models,
they face the uncertainty of knowledge and hard maintenance. To overcome the
challenges of scene knowledge, we propose a scene-driven multimodal knowledge
graph (Scene-MMKG) construction method combining conventional knowledge
engineering and large language models. A unified scene knowledge injection
framework is introduced for knowledge representation. To evaluate the
advantages of our proposed method, we instantiate Scene-MMKG considering
typical indoor robotic functionalities (Manipulation and Mobility), named
ManipMob-MMKG. Comparisons in characteristics indicate our instantiated
ManipMob-MMKG has broad superiority in data-collection efficiency and knowledge
quality. Experimental results on typical embodied tasks show that
knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the
performance obviously without re-designing model structures complexly. Our
project can be found at this https URL

通过结合传统的知识工程和大型语言模型，我们提出了一种以场景驱动的多模态知识图谱构建方法，用于知识表示和增强室内机器人功能。我们通过实例化 ManipMob-MMKG 评估了我们方法的优势，在数据收集效率和知识质量方面具有广泛的优越性。实验结果表明，使用我们实例化的 ManipMob-MMKG 进行知识增强方法可以明显改善性能，无需复杂重新设计模型结构。

以场景为驱动的多模态知识图构建用于具象人工智能

Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

Visual grounding (VG) aims to establish fine-grained alignment between vision
and language. Ideally, it can be a testbed for vision-and-language models to
evaluate their understanding of the images and texts and their reasoning
abilities over their joint space. However, most existing VG datasets are
constructed using simple description texts, which do not require sufficient
reasoning over the images and texts. This has been demonstrated in a recent
study~\cite{luo2022goes}, where a simple LSTM-based text encoder without
pretraining can achieve state-of-the-art performance on mainstream VG datasets.
Therefore, in this paper, we propose a novel benchmark of \underline{S}cene
\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG),
where the image content and referring expressions are not sufficient to ground
the target objects, forcing the models to have a reasoning ability on the
long-form scene knowledge. To perform this task, we propose two approaches to
accept the triple-type input, where the former embeds knowledge into the image
features before the image-query interaction; the latter leverages linguistic
structure to assist in computing the image-text matching. We conduct extensive
experiments to analyze the above methods and show that the proposed approaches
achieve promising results but still leave room for improvement, including
performance and interpretability. The dataset and code are available at
https://github.com/zhjohnchan/SK-VG.

本文提出了一个新的基准数据集 SK-VG，其中图像内容和指代表达不足以确定目标对象，迫使模型在长篇场景知识上具备推理能力。我们提出了两种方法来接受三元类型的输入，前者在图像查询交互之前将知识嵌入图像特征，后者利用语言结构来辅助计算图像文本匹配。通过大量实验证明了所提方法的可行性，并展示了他们取得的有希望的结果，但仍有改进的空间，包括性能和可解释性。