In this paper, we introduce an open-vocabulary panoptic segmentation model
that effectively unifies the strengths of the Segment Anything Model (SAM) with
the vision-language CLIP model in an end-to-end framework. While SAM excels in
generating spatially-aware masks, it's decoder falls short in recognizing
object class information and tends to oversegment without additional guidance.
Existing approaches address this limitation by using multi-stage techniques and
employing separate models to generate class-aware prompts, such as bounding
boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model
which leverages SAM's spatially rich features to produce instance-aware masks
and harnesses CLIP's semantically discriminative features for effective
instance classification. Specifically, we address the limitations of SAM and
propose a novel Local Discriminative Pooling (LDP) module leveraging
class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary
classification. Furthermore, we introduce a Mask-Aware Selective Ensembling
(MASE) algorithm that adaptively enhances the quality of generated masks and
boosts the performance of open-vocabulary classification during inference for
each image. We conducted extensive experiments to demonstrate our methods
strong generalization properties across multiple datasets, achieving
state-of-the-art performance with substantial improvements over SOTA
open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and
ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art
methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website:
this https URL

提出了一种开放词汇的全景分割模型，通过端到端框架有机地结合了 Segment Anything Model (SAM) 和视觉 - 语言 CLIP 模型的优势。通过使用局部判别汇聚模块（LDP），克服了 SAM 的局限性，并引入了面向掩膜的选择集成算法（MASE）来自适应地提高生成掩膜的质量，从而在多个数据集上展示了很强的泛化性能，并且在开放词汇全景分割方法方面取得了显著的改进。

PosSAM: 全景开放词汇片段检测

PosSAM: Panoptic Open-vocabulary Segment Anything

Recently, methods have been proposed for 3D open-vocabulary semantic
segmentation. Such methods are able to segment scenes into arbitrary classes
given at run-time using their text description. In this paper, we propose to
our knowledge the first algorithm for open-vocabulary panoptic segmentation,
simultaneously performing both semantic and instance segmentation. Our
algorithm, Panoptic Vision-Language Feature Fields (PVLFF) learns a feature
field of the scene, jointly learning vision-language features and hierarchical
instance features through a contrastive loss function from 2D instance segment
proposals on input frames. Our method achieves comparable performance against
the state-of-the-art close-set 3D panoptic systems on the HyperSim, ScanNet and
Replica dataset and outperforms current 3D open-vocabulary systems in terms of
semantic segmentation. We additionally ablate our method to demonstrate the
effectiveness of our model architecture. Our code will be available at
this https URL

我们提出了一种新的算法，Panoptic Vision-Language Feature Fields (PVLFF)，可以同时进行语义和实例分割，通过对输入帧上的 2D 实例分割提案应用对比损失函数来联合学习视觉 - 语言特征和分层实例特征，从而在 HyperSim、ScanNet 和 Replica 数据集上达到可比较的性能并在语义分割方面优于当前的 3D 开放词汇系统。