There has been an increasing interest in the alignment of large language
models (LLMs) with human values. However, the safety issues of their
integration with a vision module, or vision language models (VLMs), remain
relatively underexplored. In this paper, we propose a novel jailbreaking attack
against VLMs, aiming to bypass their safety barrier when a user inputs harmful
instructions. A scenario where our poisoned (image, text) data pairs are
included in the training data is assumed. By replacing the original textual
captions with malicious jailbreak prompts, our method can perform jailbreak
attacks with the poisoned images. Moreover, we analyze the effect of poison
ratios and positions of trainable parameters on our attack's success rate. For
evaluation, we design two metrics to quantify the success rate and the
stealthiness of our attack. Together with a list of curated harmful
instructions, a benchmark for measuring attack efficacy is provided. We
demonstrate the efficacy of our attack by comparing it with baseline methods.

本文提出了一种针对视觉语言模型的新型越狱攻击方法，通过替换原始文本标题为恶意越狱提示，来攻击包含恶意图像的视觉语言模型。通过分析毒素比例和可训练参数位置对攻击成功率的影响，我们设计了两个指标来量化攻击的成功率和隐秘性，提供了一个用于测量攻击效果的基准。通过与基准方法进行比较，我们证明了我们的攻击方法的有效性。