Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.

基于视觉-语言基础模型，本研究提出了一种概率标签融合方法，用于从开放集标签测量中预测闭合集语义类别，以增强基于实例感知的语义映射；通过整合各模块构建一个统一的语义映射系统，并通过ScanNet和SceneNN数据集评估了方法的零样本性能，取得了显著优于传统方法的40.3均值平均精度（mAP）的结果。

FM-Fusion: 基于视觉-语言基础模型的实例感知语义映射增强