Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.

基于视觉-语言预训练的模型(VLP)展示了在处理图像和文本方面的令人印象深刻的能力，然而它们容易受到多模态对抗样本的攻击。本研究通过探索数据增强和图像-文本模态交互之间的最佳对齐问题，提出了一种基于最优输运理论的对抗性攻击方法，命名为OT-Attack，以有效地对抗过拟合问题，并在图像-文本匹配任务中的各种网络架构和数据集中进行的广泛实验显示，OT-Attack在对抗性可迁移性方面优于现有的最先进方法。

OT-Attack: 通过最优传输优化增强视觉语言模型的对抗迁移性