Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are increasingly gaining importance. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being unrecognizable for humans. We demonstrate how fooling master images can be mined by searching the latent space of generative models by means of an evolution strategy or stochastic gradient descent. We investigate the properties of the mined fooling master images, and find that images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions. Further, we evaluate two possible mitigation strategies and find that vulnerability to fooling master examples is closely related to a modality gap in contrastive pre-trained multi-modal networks. From the perspective of vulnerability to off-manifold attacks, we therefore argue for the mitigation of modality gaps in CLIP and related multi-modal approaches. Source code and mined CLIPMasterPrints are available at https://github.com/matfrei/CLIPMasterPrints.

通过挖掘生成模型的潜在空间，利用进化策略或随机梯度下降搜索，我们展示了可以最大化CLIP模型的置信度得分，适用于大量不同的提示，但对人类不可识别的欺骗主图像。我们研究了挖掘的欺骗主图像的属性，发现训练于少量图像标题的图像可能普遍适用于更多数量的语义相关标题。此外，我们评估了两种可能的缓解策略，并发现对欺骗主例子的脆弱性与对比式预训练多模态网络中的模态间隔密切相关。因此，我们提出减少CLIP和相关多模态方法中的模态间隔来缓解不在数据流形上攻击的脆弱性。

CLIPMasterPrints：利用潜变量演化欺骗对比性语言图像预训练