In this paper, we introduce a black-box prompt optimization method that uses
an attacker LLM agent to uncover higher levels of memorization in a victim
agent, compared to what is revealed by prompting the target model with the
training data directly, which is the dominant approach of quantifying
memorization in LLMs. We use an iterative rejection-sampling optimization
process to find instruction-based prompts with two main characteristics: (1)
minimal overlap with the training data to avoid presenting the solution
directly to the model, and (2) maximal overlap between the victim model's
output and the training data, aiming to induce the victim to spit out training
data. We observe that our instruction-based prompts generate outputs with 23.7%
higher overlap with training data compared to the baseline prefix-suffix
measurements. Our findings show that (1) instruction-tuned models can expose
pre-training data as much as their base-models, if not more so, (2) contexts
other than the original training data can lead to leakage, and (3) using
instructions proposed by other LLMs can open a new avenue of automated attacks
that we should further study and explore. The code can be found at
this https URL .

我们介绍了一种黑盒提示优化方法，利用攻击者 LLM 代理来揭示受害者代理中比直接使用训练数据作为提示目标模型所揭示的更高水平的记忆，我们使用迭代的拒绝抽样优化过程来找到具有两个主要特征的基于指令的提示，即 (1) 最小程度地与训练数据重叠，以避免直接向模型呈现解决方案；(2) 最大化受害模型输出与训练数据的重叠，旨在诱使受害模型输出训练数据，我们观察到，与基于前缀 - 后缀测量的基准相比，我们的基于指令的提示生成的输出与训练数据重叠度更高达 23.7%，我们的发现表明，(1) 基于指令的模型可以暴露出与其基础模型一样多的预训练数据，甚至更多；(2) 原始训练数据之外的上下文可以导致信息泄漏；(3) 使用其他 LLM 提出的指令可能会开辟一种新的自动攻击的途径，需要进一步研究和探索。代码可以在此 URL 找到。

羊驼对抗维昆纳：利用 LLMs 揭示 LLMs 的记忆

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

Despite efforts to align large language models to produce harmless responses,
they are still vulnerable to jailbreak prompts that elicit unrestricted
behaviour. In this work, we investigate persona modulation as a black-box
jailbreaking method to steer a target model to take on personalities that are
willing to comply with harmful instructions. Rather than manually crafting
prompts for each persona, we automate the generation of jailbreaks using a
language model assistant. We demonstrate a range of harmful completions made
possible by persona modulation, including detailed instructions for
synthesising methamphetamine, building a bomb, and laundering money. These
automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is
185 times larger than before modulation (0.23%). These prompts also transfer to
Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%,
respectively. Our work reveals yet another vulnerability in commercial large
language models and highlights the need for more comprehensive safeguards.

探讨了人设调节作为黑盒越狱方法，用于引导目标模型具备遵循有害指令的个性。我们利用自动生成的越狱命令展示了多种有害完成操作，包括合成甲基苯丙胺、制造炸弹和洗钱的详细指南。这些自动化攻击在 GPT-4 中的有害完成率为 42.5%，是调节之前（0.23%）的 185 倍。这些命令还传输到 Claude 2 和 Vicuna，他们的有害完成率分别为 61.0% 和 35.9%。我们的研究揭示了商用大型语言模型中的又一个漏洞，并强调对更全面的安全保护措施的需求。