Colonoscopic Polyp Re-Identification aims to match a specific polyp in a large gallery with different cameras and views, which plays a key role for the prevention and treatment of colorectal cancer in the computer-aided diagnosis. However, traditional methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which may easily leads to poor generalization capability when adapted the pretrained model into the new scenarios. To relieve this dilemma, we propose a simple but effective training method named VT-ReID, which can remarkably enrich the representation of polyp videos with the interchange of high-level semantic information. Moreover, we elaborately design a novel clustering mechanism to introduce prior knowledge from textual data, which leverages contrastive learning to promote better separation from abundant unlabeled text data. To the best of our knowledge, this is the first attempt to employ the visual-text feature with clustering mechanism for the colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the art methods with a clear margin.

在结直肠镜息肉再识别中，传统方法侧重于视觉表示学习，而忽略了在训练期间探索语义特征的潜力，这可能导致在新场景中使用预训练模型时存在较差的泛化能力。为了缓解这一困境，我们提出了一种名为VT-ReID的简单而有效的训练方法，可以通过高层语义信息的交换显著丰富息肉视频的表示。此外，我们精心设计了一种新颖的聚类机制，通过对比学习引入文本数据的先验知识，以促进与丰富的无标签文本数据更好的分离。据我们所知，这是首次尝试在结直肠镜息肉再识别中使用视觉文本特征和聚类机制。实证结果表明，我们的方法明显优于当前最先进的方法。

学习针对息肉重新识别的判别性视觉文本表示