Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works has explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceive diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involves imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) and construct a large-scale complex scene dataset (\textbf{OVCamo}) which containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks.

最近，大规模视觉语言模型（VLM）的出现，如CLIP，为开放世界的物体感知打开了道路。我们提出了一个新的任务，开放词汇伪装的物体分割（OVCOS），并构建了一个包含11,483个精选图像和细粒度注释的大规模复杂场景数据集（OVCamo）。通过集成类别语义知识的指导和依赖边缘和深度信息的视觉结构线索的补充，所提出的方法可以有效地捕捉伪装对象。此外，这个有效的框架在我们的OVCamo数据集上也超过了先前状态-of-the-art的开放词汇语义图像分割方法。借助提出的数据集和基线，我们希望这个具有更多实际价值的新任务能进一步扩展开放词汇密集预测任务的研究。

开放词汇伪装物体分割