This paper focuses on jailbreaking attacks against multi-modal large language
models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to
harmful user queries. A maximum likelihood-based algorithm is proposed to find
an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs
across multiple unseen prompts and images (i.e., data-universal property). Our
approach exhibits strong model-transferability, as the generated imgJP can be
transferred to jailbreak various models, including MiniGPT-v2, LLaVA,
InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a
connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we
introduce a construction-based method to harness our approach for
LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art
methods. The code is available here. \textbf{Warning: some content generated by
language models may be offensive to some readers.}

该研究聚焦于多模态大型语言模型（MLLMs）的越狱攻击，旨在引导 MLLMs 生成令人反感的响应来对抗危险用户查询。提出了一种基于最大似然的算法，可以寻找 “图像越狱提示”（imgJP），在多个未知提示和图像上实现对 MLLMs 的越狱。我们的方法具有很强的模型可迁移性，生成的 imgJP 可被转移到各种模型中，包括 MiniGPT-v2、LLaVA、InstructBLIP 和 mPLUG-Owl2 等，以黑盒方式进行越狱。此外，我们揭示了 MLLM 越狱和 LLM 越狱之间的联系。因此，我们引入了一种基于构造的方法，将我们的方法应用于 LLM 越狱，比当前最先进的方法更高效。代码可在此处找到。警告：一些由语言模型生成的内容可能对某些读者具有冒犯性。