In this paper, we introduce an open-vocabulary panoptic segmentation model
that effectively unifies the strengths of the Segment Anything Model (SAM) with
the vision-language CLIP model in an end-to-end framework. While SAM excels in
generating spatially-aware masks, it's decoder falls short in recognizing
object class information and tends to oversegment without additional guidance.
Existing approaches address this limitation by using multi-stage techniques and
employing separate models to generate class-aware prompts, such as bounding
boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model
which leverages SAM's spatially rich features to produce instance-aware masks
and harnesses CLIP's semantically discriminative features for effective
instance classification. Specifically, we address the limitations of SAM and
propose a novel Local Discriminative Pooling (LDP) module leveraging
class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary
classification. Furthermore, we introduce a Mask-Aware Selective Ensembling
(MASE) algorithm that adaptively enhances the quality of generated masks and
boosts the performance of open-vocabulary classification during inference for
each image. We conducted extensive experiments to demonstrate our methods
strong generalization properties across multiple datasets, achieving
state-of-the-art performance with substantial improvements over SOTA
open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and
ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art
methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website:
this https URL

提出了一种开放词汇的全景分割模型，通过端到端框架有机地结合了 Segment Anything Model (SAM) 和视觉 - 语言 CLIP 模型的优势。通过使用局部判别汇聚模块（LDP），克服了 SAM 的局限性，并引入了面向掩膜的选择集成算法（MASE）来自适应地提高生成掩膜的质量，从而在多个数据集上展示了很强的泛化性能，并且在开放词汇全景分割方法方面取得了显著的改进。