We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available.

SignCLIP通过重新利用CLIP将口语文本和手语视频投影到相同的空间中，用于学习大规模、多语言视频文本对中有用的视觉表示，旨在处理手语。SignCLIP在Spreadthesign上进行预训练，通过不同的下游数据集评估其性能，具有显著的文本-视频/视频-文本检索准确性，并在一些无关的任务（如孤立手语识别）上表现出竞争力。研究分析了口语文本和手语姿势形成的潜在空间，得出了额外的语言洞见。

SignCLIP：对比学习连接文字和手语