In this paper, we study the harmlessness alignment problem of multimodal
large language models~(MLLMs). We conduct a systematic empirical analysis of
the harmlessness performance of representative MLLMs and reveal that the image
input poses the alignment vulnerability of MLLMs. Inspired by this, we propose
a novel jailbreak method named HADES, which hides and amplifies the harmfulness
of the malicious intent within the text input, using meticulously crafted
images. Experimental results show that HADES can effectively jailbreak existing
MLLMs, which achieves an average Attack Success Rate~(ASR) of 90.26% for
LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly
released.

研究了多模态大型语言模型（MLLMs）的无害对齐问题，通过对代表性 MLLMs 的无害性能进行系统的实证分析，揭示了图像输入对 MLLMs 的对准易受攻击的弱点。在此基础上，提出了一种名为 HADES 的新型越狱方法，利用精心制作的图像隐藏和放大文本输入中的恶意意图的有害性。实验结果表明，HADES 能够有效地越狱现有的 MLLMs，其中对于 LLaVA-1.5 平均攻击成功率（ASR）为 90.26%，对于 Gemini Pro Vision 为 71.60%。我们的代码和数据将会公开发布。