Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP

通过扩展数据集和模型架构，该研究进一步探索了具有对比语言-图像预训练（CLIP）的视觉语言任务的性能，在处理来自网站的图像-文本对时。通过引入多样化的描述生成框架，该研究提出了RWKV-CLIP，其中结合了变压器的有效并行训练和循环神经网络的高效推理。通过广泛的实验和多种模型规模和预训练数据集，证明了RWKV-CLIP是一个强大而有效的视觉语言表征学习器，在线性探测、零样例分类和零样例图像-文本检索等多个下游任务中实现了最先进的性能。

RWKV-CLIP：一个稳健的视觉-语言表示学习器