Red-teaming, or identifying prompts that elicit harmful responses, is a
critical step in ensuring the safe and responsible deployment of large language
models (LLMs). Developing effective protection against many modes of attack
prompts requires discovering diverse attacks. Automated red-teaming typically
uses reinforcement learning to fine-tune an attacker language model to generate
prompts that elicit undesirable responses from a target LLM, as measured, for
example, by an auxiliary toxicity classifier. We show that even with explicit
regularization to favor novelty and diversity, existing approaches suffer from
mode collapse or fail to generate effective attacks. As a flexible and
probabilistically principled alternative, we propose to use GFlowNet
fine-tuning, followed by a secondary smoothing phase, to train the attacker
model to generate diverse and effective attack prompts. We find that the
attacks generated by our method are effective against a wide range of target
LLMs, both with and without safety tuning, and transfer well between target
LLMs. Finally, we demonstrate that models safety-tuned using a dataset of
red-teaming prompts generated by our method are robust to attacks from other
RL-based red-teaming approaches.

使用 GFlowNet fine-tuning 和二次平滑阶段对攻击者模型进行训练，生成多样且有效的攻击触发词，攻击方法对多种目标大语言模型有效，且通过基于强化学习的红队方法生成的红队训练触发词进行模型安全调优可有效防护。

学习大型语言模型上多样化的攻击方法，用于鲁棒性红队和安全优化

Learning diverse attacks on large language models for robust red-teaming  and safety tuning

Numerous capability and safety techniques of Large Language Models (LLMs),
including RLHF, automated red-teaming, prompt engineering, and infilling, can
be cast as sampling from an unnormalized target distribution defined by a given
reward or potential function over the full sequence. In this work, we leverage
the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic
inference problems. In particular, we use learned twist functions to estimate
the expected future value of the potential at each timestep, which enables us
to focus inference-time computation on promising partial sequences. We propose
a novel contrastive method for learning the twist functions, and establish
connections with the rich literature of soft reinforcement learning. As a
complementary application of our twisted SMC framework, we present methods for
evaluating the accuracy of language model inference techniques using novel
bidirectional SMC bounds on the log partition function. These bounds can be
used to estimate the KL divergence between the inference and target
distributions in both directions. We apply our inference evaluation techniques
to show that twisted SMC is effective for sampling undesirable outputs from a
pretrained model (a useful component of harmlessness training and automated
red-teaming), generating reviews with varied sentiment, and performing
infilling tasks.

本论文介绍了大型语言模型的能力和安全技术，其中包括强化式高阶采样、自动红队测试、提示工程和填充等，并使用序贯蒙特卡罗方法解决这些概率推理问题。我们提出了一种学习扭曲函数的对比方法，并将其与软强化学习的丰富文献进行了联系。此外，我们还应用了扭曲的序贯蒙特卡罗框架进行推理评估，并展示了它在预训练模型样本采样、生成具有不同情感的评论以及填充任务中的有效性。