Recent advancements in Large Vision-Language Models (VLMs) have underscored
their superiority in various multimodal tasks. However, the adversarial
robustness of VLMs has not been fully explored. Existing methods mainly assess
robustness through unimodal adversarial attacks that perturb images, while
assuming inherent resilience against text-based attacks. Different from
existing attacks, in this work we propose a more comprehensive strategy that
jointly attacks both text and image modalities to exploit a broader spectrum of
vulnerability within VLMs. Specifically, we propose a dual optimization
objective aimed at guiding the model to generate affirmative responses with
high toxicity. Our attack method begins by optimizing an adversarial image
prefix from random noise to generate diverse harmful responses in the absence
of text input, thus imbuing the image with toxic semantics. Subsequently, an
adversarial text suffix is integrated and co-optimized with the adversarial
image prefix to maximize the probability of eliciting affirmative responses to
various harmful instructions. The discovered adversarial image prefix and text
suffix are collectively denoted as a Universal Master Key (UMK). When
integrated into various malicious queries, UMK can circumvent the alignment
defenses of VLMs and lead to the generation of objectionable content, known as
jailbreaks. The experimental results demonstrate that our universal attack
strategy can effectively jailbreak MiniGPT-4 with a 96% success rate,
highlighting the vulnerability of VLMs and the urgent need for new alignment
strategies.

通过对大规模视觉语言模型的攻击，我们提出了一种综合性的策略，该策略同时攻击文本和图像模态，以利用视觉语言模型内的更广泛的脆弱性。我们的实验结果表明，我们的通用攻击策略可以有效地越狱 MiniGPT-4，成功率达到 96％，突显了视觉语言模型的脆弱性和对新的对齐策略的迫切需求。

大型视觉语言模型的白盒多模态越狱

White-box Multimodal Jailbreaks Against Large Vision-Language Models

The generations of large language models are commonly controlled through
prompting techniques, where a user's query to the model is prefixed with a
prompt that aims to guide the model's behaviour on the query. The prompts used
by companies to guide their models are often treated as secrets, to be hidden
from the user making the query. They have even been treated as commodities to
be bought and sold. However, there has been anecdotal evidence showing that the
prompts can be extracted by a user even when they are kept secret. In this
paper, we present a framework for systematically measuring the success of
prompt extraction attacks. In experiments with multiple sources of prompts and
multiple underlying language models, we find that simple text-based attacks can
in fact reveal prompts with high probability.

本文介绍了一种用于测量和攻击大型语言模型中 Prompt 的框架，通过实验展示了文本攻击可以高概率地成功提取 prompt。