Large language models (LLMs), designed to provide helpful and safe responses,
often rely on alignment techniques to align with user intent and social
guidelines. Unfortunately, this alignment can be exploited by malicious actors
seeking to manipulate an LLM's outputs for unintended purposes. In this paper
we introduce a novel approach that employs a genetic algorithm (GA) to
manipulate LLMs when model architecture and parameters are inaccessible. The GA
attack works by optimizing a universal adversarial prompt that -- when combined
with a user's query -- disrupts the attacked model's alignment, resulting in
unintended and potentially harmful outputs. Our novel approach systematically
reveals a model's limitations and vulnerabilities by uncovering instances where
its responses deviate from expected behavior. Through extensive experiments we
demonstrate the efficacy of our technique, thus contributing to the ongoing
discussion on responsible AI development by providing a diagnostic tool for
evaluating and enhancing alignment of LLMs with human intent. To our knowledge
this is the first automated universal black box jailbreak attack.

介绍了一种使用遗传算法来操纵无法访问模型结构和参数的大型语言模型的新方法，通过优化通用对抗提示，发现模型的限制和漏洞，从而破坏模型的对齐性，提供诊断工具以评估和增强大型语言模型与人类意图的一致性。

通用黑盒破解大型语言模型

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Recently, NLP has seen a surge in the usage of large pre-trained models.
Users download weights of models pre-trained on large datasets, then fine-tune
the weights on a task of their choice. This raises the question of whether
downloading untrusted pre-trained weights can pose a security threat. In this
paper, we show that it is possible to construct ``weight poisoning'' attacks
where pre-trained weights are injected with vulnerabilities that expose
``backdoors'' after fine-tuning, enabling the attacker to manipulate the model
prediction simply by injecting an arbitrary keyword. We show that by applying a
regularization method, which we call RIPPLe, and an initialization procedure,
which we call Embedding Surgery, such attacks are possible even with limited
knowledge of the dataset and fine-tuning procedure. Our experiments on
sentiment classification, toxicity detection, and spam detection show that this
attack is widely applicable and poses a serious threat. Finally, we outline
practical defenses against such attacks. Code to reproduce our experiments is
available at this https URL

该研究重点讨论了使用大型预先训练模型的安全风险，提出了一种称为 RIPPLe 的正则化方法和一种称为嵌入手术的初始化方法，以防止称为权重污染的攻击，该攻击可以注入安全漏洞。进一步实验证明该攻击可能性很高，对多种应用都带来了严重威胁。