In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

我们提出了一种新的方法来从长描述中密集地连接视觉实体，利用大型多模态模型提取语义名词，利用无类别分割模型生成实体级分割，采用多模态特征融合模块将每个语义名词与其对应的分割蒙版关联。此方法利用颜色映射对实体分割蒙版进行编码，使得细粒度预测能够保留高分辨率蒙版的特征。该方法使用LMM中的CLIP视觉编码器从低分辨率图像中提取视觉特征，比使用额外编码器处理高分辨率图像的现有方法在计算上更高效。我们的全面实验表明，我们的方法卓越于三个任务，包括全景叙事连接、指称表达分割和全景分割。

基于大语言模型的通用实体链接