Segment Anything Model (SAM) exhibits powerful yet versatile capabilities on (un) conditional image segmentation tasks recently. Although SAM can support various segmentation prompts, we note that, compared to point- and box-guided segmentation, it performs much worse on text-instructed tasks. We argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. In this paper, two \emph{deep instruction tuning} (DIT) methods are proposed, one is end-to-end and the other is layer-wise. With these tuning methods, we can regard the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch. Extensive experiments on three highly competitive benchmark datasets of referring image segmentation show that a simple end-to-end DIT improves SAM by a large margin, with layer-wise DIT further boosts the performance to state-of-the-art. Our code is anonymously released at: https://github.com/wysnzzzz/DIT.

对于Segment Anything Model（SAM）的（非）条件图像分割任务，通过研究发现，与点-盒引导分割相比，SAM在文本引导任务上表现较差，因其默认的轻量级遮罩解码器中的浅层融合方案。本文提出了两种深度指令调优方法，一种是端到端的，另一种是逐层的。通过这些调优方法，我们可以将SAM的图像编码器视为独立的视觉-语言学习器，而不是构建另一个深度融合分支。对三个高度竞争的参考图像分割基准数据集进行的大量实验证明，简单的端到端DIT显著提高了SAM的性能，而逐层DIT进一步将其推向了最先进水平。

深度指令调优针对片段化模型