In this paper, we study the harmlessness alignment problem of multimodal
large language models~(MLLMs). We conduct a systematic empirical analysis of
the harmlessness performance of representative MLLMs and reveal that the image
input poses the alignment vulnerability of MLLMs. Inspired by this, we propose
a novel jailbreak method named HADES, which hides and amplifies the harmfulness
of the malicious intent within the text input, using meticulously crafted
images. Experimental results show that HADES can effectively jailbreak existing
MLLMs, which achieves an average Attack Success Rate~(ASR) of 90.26% for
LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly
released.

研究了多模态大型语言模型（MLLMs）的无害对齐问题，通过对代表性 MLLMs 的无害性能进行系统的实证分析，揭示了图像输入对 MLLMs 的对准易受攻击的弱点。在此基础上，提出了一种名为 HADES 的新型越狱方法，利用精心制作的图像隐藏和放大文本输入中的恶意意图的有害性。实验结果表明，HADES 能够有效地越狱现有的 MLLMs，其中对于 LLaVA-1.5 平均攻击成功率（ASR）为 90.26%，对于 Gemini Pro Vision 为 71.60%。我们的代码和数据将会公开发布。

图像是对齐的弱点：利用视觉漏洞对跨模态大型语言模型进行越狱

Images are Achilles' Heel of Alignment: Exploiting Visual  Vulnerabilities for Jailbreaking Multimodal Large Language Models

Cyber-phishing attacks recently became more precise, targeted, and tailored
by training data to activate only in the presence of specific information or
cues. They are adaptable to a much greater extent than traditional phishing
detection. Hence, automated detection systems cannot always be 100% accurate,
increasing the uncertainty around expected behavior when faced with a potential
phishing email. On the other hand, human-centric defence approaches focus
extensively on user training but face the difficulty of keeping users up to
date with continuously emerging patterns. Therefore, advances in analyzing the
content of an email in novel ways along with summarizing the most pertinent
content to the recipients of emails is a prospective gateway to furthering how
to combat these threats. Addressing this gap, this work leverages
transformer-based machine learning to (i) analyze prospective psychological
triggers, to (ii) detect possible malicious intent, and (iii) create
representative summaries of emails. We then amalgamate this information and
present it to the user to allow them to (i) easily decide whether the email is
"phishy" and (ii) self-learn advanced malicious patterns.

本文探讨了机器学习在分析邮件中的内容、检测潜在恶意和创建邮件摘要方面的应用，以帮助用户判断是否安全及自学习更高级的恶意模式。