Large vision-language models (VLMs) like GPT-4V represent an unprecedented
revolution in the field of artificial intelligence (AI). Compared to
single-modal large language models (LLMs), VLMs possess more versatile
capabilities by incorporating additional modalities (e.g., images). Meanwhile,
there's a rising enthusiasm in the AI community to develop open-source VLMs,
such as LLaVA and MiniGPT4, which, however, have not undergone rigorous safety
assessment. In this paper, to demonstrate that more modalities lead to
unforeseen AI safety issues, we propose FigStep, a novel jailbreaking framework
against VLMs. FigStep feeds harmful instructions into VLMs through the image
channel and then uses benign text prompts to induce VLMs to output contents
that violate common AI safety policies. Our experimental results show that
FigStep can achieve an average attack success rate of 94.8% across 2 families
of popular open-source VLMs, LLaVA and MiniGPT4 (a total of 5 VLMs). Moreover,
we demonstrate that the methodology of FigStep can even jailbreak GPT-4V, which
already leverages several system-level mechanisms to filter harmful queries.
Above all, our experimental results reveal that VLMs are vulnerable to
jailbreaking attacks, which highlights the necessity of novel safety alignments
between visual and textual modalities.

通过引入图像通道进行有害指令注入，再使用良性文本提示诱导视觉语言模型输出违反常见人工智能安全策略的内容，本研究证明了视觉语言模型容易受到越狱攻击的漏洞，彰显出视觉与文本模态之间的新型安全对齐的必要性。