In this paper, we introduce a black-box prompt optimization method that uses
an attacker LLM agent to uncover higher levels of memorization in a victim
agent, compared to what is revealed by prompting the target model with the
training data directly, which is the dominant approach of quantifying
memorization in LLMs. We use an iterative rejection-sampling optimization
process to find instruction-based prompts with two main characteristics: (1)
minimal overlap with the training data to avoid presenting the solution
directly to the model, and (2) maximal overlap between the victim model's
output and the training data, aiming to induce the victim to spit out training
data. We observe that our instruction-based prompts generate outputs with 23.7%
higher overlap with training data compared to the baseline prefix-suffix
measurements. Our findings show that (1) instruction-tuned models can expose
pre-training data as much as their base-models, if not more so, (2) contexts
other than the original training data can lead to leakage, and (3) using
instructions proposed by other LLMs can open a new avenue of automated attacks
that we should further study and explore. The code can be found at
this https URL .

我们介绍了一种黑盒提示优化方法，利用攻击者 LLM 代理来揭示受害者代理中比直接使用训练数据作为提示目标模型所揭示的更高水平的记忆，我们使用迭代的拒绝抽样优化过程来找到具有两个主要特征的基于指令的提示，即 (1) 最小程度地与训练数据重叠，以避免直接向模型呈现解决方案；(2) 最大化受害模型输出与训练数据的重叠，旨在诱使受害模型输出训练数据，我们观察到，与基于前缀 - 后缀测量的基准相比，我们的基于指令的提示生成的输出与训练数据重叠度更高达 23.7%，我们的发现表明，(1) 基于指令的模型可以暴露出与其基础模型一样多的预训练数据，甚至更多；(2) 原始训练数据之外的上下文可以导致信息泄漏；(3) 使用其他 LLM 提出的指令可能会开辟一种新的自动攻击的途径，需要进一步研究和探索。代码可以在此 URL 找到。

羊驼对抗维昆纳：利用 LLMs 揭示 LLMs 的记忆

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

Large language models (LLMs) have shown impressive success in various
applications. However, these models are often not well aligned with human
intents, which calls for additional treatments on them, that is, the alignment
problem. To make LLMs better follow user instructions, existing alignment
methods mostly focus on further training them. However, the extra training of
LLMs are usually expensive in terms of GPU compute; worse still, LLMs of
interest are oftentimes not accessible for user-demanded training, such as
GPTs. In this work, we take a different perspective -- Black-Box Prompt
Optimization (BPO) -- to perform alignments. The idea is to optimize user
prompts to suit LLMs' input understanding, so as to best realize users' intents
without updating LLMs' parameters. BPO is model-agnostic and the empirical
results demonstrate that the BPO-aligned ChatGPT yields a 22\% increase in the
win rate against its original version, and 10\% for GPT-4. Importantly, the
\model-aligned LLMs can outperform the same models aligned by PPO and DPO, and
it also brings additional performance gains when combining \model with PPO or
DPO. Code and datasets are released at this https URL

通过黑盒提示优化（BPO）进行对齐，使得大型语言模型（LLMs）更好地遵循用户指令，以最佳方式实现用户意图，而无需更新 LLMs 的参数，并且 BPO 对齐的 ChatGPT 在胜率上比原始版本提高了 22％，GPT-4 提高了 10％。