Red teaming is a common strategy for identifying weaknesses in generative
language models (LMs), where adversarial prompts are produced that trigger an
LM to generate unsafe responses. Red teaming is instrumental for both model
alignment and evaluation, but is labor-intensive and difficult to scale when
done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a
red teaming method for automatically generating diverse prompts that are likely
to cause an LM to output unsafe responses. GBRT is a form of prompt learning,
trained by scoring an LM response with a safety classifier and then
backpropagating through the frozen safety classifier and LM to update the
prompt. To improve the coherence of input prompts, we introduce two variants
that add a realism loss and fine-tune a pretrained model to generate the
prompts instead of learning the prompts directly. Our experiments show that
GBRT is more effective at finding prompts that trigger an LM to generate unsafe
responses than a strong reinforcement learning-based red teaming approach, and
succeeds even when the LM has been fine-tuned to produce safer outputs.

基于梯度的红队技术（GBRT）是一种自动生成多样的提示，很可能导致语言模型输出不安全回应的红队技术方法。通过将 LM 回应与安全分类器进行评分并通过冻结的安全分类器和 LM 进行反向传播来更新提示，我们训练了 GBRT。为了提高输入提示的连贯性，我们引入了两个变体，即添加现实损失和微调预训练模型以生成提示，而不是直接学习提示。实验结果表明，与强化学习为基础的红队技术方法相比，GBRT 在找到触发语言模型生成不安全回应的提示方面更为有效，即使该 LM 已被微调以生成更安全的输出。