In this paper, we introduce a black-box prompt optimization method that uses
an attacker LLM agent to uncover higher levels of memorization in a victim
agent, compared to what is revealed by prompting the target model with the
training data directly, which is the dominant approach of quantifying
memorization in LLMs. We use an iterative rejection-sampling optimization
process to find instruction-based prompts with two main characteristics: (1)
minimal overlap with the training data to avoid presenting the solution
directly to the model, and (2) maximal overlap between the victim model's
output and the training data, aiming to induce the victim to spit out training
data. We observe that our instruction-based prompts generate outputs with 23.7%
higher overlap with training data compared to the baseline prefix-suffix
measurements. Our findings show that (1) instruction-tuned models can expose
pre-training data as much as their base-models, if not more so, (2) contexts
other than the original training data can lead to leakage, and (3) using
instructions proposed by other LLMs can open a new avenue of automated attacks
that we should further study and explore. The code can be found at
this https URL .

我们介绍了一种黑盒提示优化方法，利用攻击者 LLM 代理来揭示受害者代理中比直接使用训练数据作为提示目标模型所揭示的更高水平的记忆，我们使用迭代的拒绝抽样优化过程来找到具有两个主要特征的基于指令的提示，即 (1) 最小程度地与训练数据重叠，以避免直接向模型呈现解决方案；(2) 最大化受害模型输出与训练数据的重叠，旨在诱使受害模型输出训练数据，我们观察到，与基于前缀 - 后缀测量的基准相比，我们的基于指令的提示生成的输出与训练数据重叠度更高达 23.7%，我们的发现表明，(1) 基于指令的模型可以暴露出与其基础模型一样多的预训练数据，甚至更多；(2) 原始训练数据之外的上下文可以导致信息泄漏；(3) 使用其他 LLM 提出的指令可能会开辟一种新的自动攻击的途径，需要进一步研究和探索。代码可以在此 URL 找到。

羊驼对抗维昆纳：利用 LLMs 揭示 LLMs 的记忆

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

Large language models (LLMs) executing tasks through instruction-based
prompts often face challenges stemming from distribution differences between
user instructions and training instructions. This leads to distractions and
biases, especially when dealing with inconsistent dynamic labels. In this
paper, we introduces a novel bias mitigation method, CRISPR, designed to
alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods
to identify bias neurons influencing biased outputs and employs pruning to
eliminate the bias neurons. Experimental results demonstrate the method's
effectiveness in mitigating biases in instruction-based prompting, enhancing
language model performance on social bias benchmarks without compromising
pre-existing knowledge. CRISPR proves highly practical, model-agnostic,
offering flexibility in adapting to evolving social biases.

该研究介绍了一种用于减轻大型语言模型中指令 - 标签偏见的新的偏见缓解方法，称为 CRISPR，该方法利用属性方法识别有影响的偏见神经元并通过修剪来消除这些偏见神经元。实验结果表明，CRISPR 在减轻指令 - 标签偏见方面非常有效，在社交偏见基准上提高了语言模型的性能，同时不损害现有知识。CRISPR 是高度实用和模型无关的，具有在应对不断演变的社交偏见中灵活性的特点。

CRISPR：从指令跟踪语言模型中消除偏置神经元

CRISPR: Eliminating Bias Neurons from an Instruction-following Language  Model

Many LLMs are trained to perform zero-shot or few-shot inference using
instruction-based prompts. Crafting prompts for these LLMs typically requires
the user to provide a detailed task description, examples of context and
completion, and single example of context for inference. This regular prompt
baseline is referred to as SinglePrompt in this paper. However, for NLP tasks
where each data point for inference is not necessarily lengthy, the token count
for instructions and few-shot examples in the prompt may be considerably larger
than that of the data point, resulting in lower token-resource utilization
compared with encoder-based models like fine-tuned BERT. This cost-efficiency
issue, affecting inference speed and compute budget, counteracts the many
benefits LLMs have to offer. This paper aims to alleviate the preceding problem
by batching multiple data points into a single prompt, a prompting strategy we
refer to as BatchPrompt. This strategy increases the density of data points,
which in turn leads to improved token utilization. Applying BatchPrompt
naively, however, is very challenging due to significant performance
degradation, as observed in our experiments. We also noticed varying inference
outcomes for the same data point appearing in different positions within a
prompt. To address the quality issue while remain high token-resource
utilization, we introduce Batch Permutation and Ensembling for BatchPrompt, a
simple way that recovers labeling quality through majority votes from data
points placed in varying positions in a batch at the price of more token usage.
To counterbalance the additional token usage caused by the voting process, we
further propose Self-reflection-guided EArly Stopping, which can terminate the
voting process early for data points the LLM confidently handles.

這篇論文介紹了一種新的提示策略 ——BatchPrompt，以增強語言模型的效能，並通過 Self-reflection-guided EArly Stopping 來減少額外的 token 使用。