This study develops and evaluates a novel multimodal medical image zero-shot segmentation algorithm named Text-Visual-Prompt SAM (TV-SAM) without any manual annotations. TV-SAM incorporates and integrates large language model GPT-4, Vision Language Model GLIP, and Segment Anything Model (SAM), to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing SAM for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training, significantly outperforming SAM AUTO and GSAM, closely matching the performance of SAM BBOX with gold standard bounding box prompts, and surpassing the state-of-the-art on specific datasets like ISIC and WBC. The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, it could enhance the capability to address complex problems in specialized domains. The code is available at: https://github.com/JZK00/TV-SAM.

该研究开发和评估了一种新的多模态医学图像零样本分割算法，名为文本-视觉提示SAM（TV-SAM），无需任何手动注释。该算法利用大型语言模型GPT-4、视觉语言模型GLIP和段落与图像模型SAM，从医学图像中自动生成描述性文本提示和视觉边界框提示，从而增强SAM的零样本分割能力。全面的评估在七个公开数据集上进行，涵盖了八种成像模态，证明TV-SAM可以有效地在各种模态下分割未见目标，无需额外训练，在性能上明显优于SAM AUTO和GSAM，与SAM BBOX加上金标准边界框提示的性能相当，并在ISIC和WBC等特定数据集上超越了现有技术水平。该研究表明，TV-SAM是一种有效的多模态医学图像零样本分割算法，凸显了GPT-4在零样本分割中的重要贡献。通过整合GPT-4、GLIP和SAM等基础模型，可以增强解决专业领域复杂问题的能力。

使用GPT-4生成的描述性提示，在多模态医学图像上提高SAM零样本性能