Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. The CLIP model, with its rich semantic features learned from a large corpus of image-text pairs, is well-suited for this task. In this paper, we present a two-stage semi-supervised image captioning approach that exploits the potential of CLIP encoding. Our model comprises a CLIP visual encoder, a mapping network, and a language model for text generation. In the initial stage, we train the model using a small labeled dataset by contrasting the generated captions with the ground truth captions. In the subsequent stage, we continue the training using unlabeled images, aiming to maximize the image-caption similarity based on CLIP embeddings. Remarkably, despite utilizing less than 2% of the COCO-captions, our approach delivers a performance comparable to state-of-the-art models trained on the complete dataset. Furthermore, the captions generated by our approach are more distinctive, informative, and in line with human preference.

本文提出了一种利用CLIP模型进行半监督图像标注的方法，包括图像编码器、映射网络和语言模型，通过对比生成的标题和实际标题，并使用未标记的图像进行二次训练，得到了与完整数据集训练的业界最先进模型相比可比的性能，且标题更加独特、信息量更大，并且符合人类的偏好。

使用CLIP的半监督图像字幕生成