Large language models (LLMs) hold great potential for many natural language
applications but risk generating incorrect or toxic content. To probe when an
LLM generates unwanted content, the current paradigm is to recruit a
\textit{red team} of human testers to design input prompts (i.e., test cases)
that elicit undesirable responses from LLMs. However, relying solely on human
testers is expensive and time-consuming. Recent works automate red teaming by
training a separate red team LLM with reinforcement learning (RL) to generate
test cases that maximize the chance of eliciting undesirable responses from the
target LLM. However, current RL methods are only able to generate a small
number of effective test cases resulting in a low coverage of the span of
prompts that elicit undesirable responses from the target LLM. To overcome this
limitation, we draw a connection between the problem of increasing the coverage
of generated test cases and the well-studied approach of curiosity-driven
exploration that optimizes for novelty. Our method of curiosity-driven red
teaming (CRT) achieves greater coverage of test cases while mantaining or
increasing their effectiveness compared to existing methods. Our method, CRT
successfully provokes toxic responses from LLaMA2 model that has been heavily
fine-tuned using human preferences to avoid toxic outputs. Code is available at
https://github.com/Improbable-AI/curiosity_redteam

通过好奇心驱动的红队（CRT），我们提出了一种自动生成测试用例的方法，以增加生成的测试用例的覆盖范围，并成功地从经过重度优化以避免有害结果的 LLaMA2 模型中引发有害回应。

大型语言模型的好奇心驱动的红队扮演

Curiosity-driven Red-teaming for Large Language Models

The rapid advancement of large language models (LLMs) presents both
opportunities and challenges, particularly concerning unintentional generation
of harmful and toxic responses. While the traditional alignment methods strive
to steer LLMs towards desired performance and shield them from malicious
content, this study proposes a novel alignment strategy rooted in mistake
analysis by exposing LLMs to flawed outputs purposefully and then conducting a
thorough assessment to fully comprehend internal reasons via natural language
analysis. Thus, toxic responses can be transformed into instruction tuning
corpus for model alignment, and LLMs can not only be deterred from generating
flawed responses but also trained to self-criticize, leveraging its innate
ability to discriminate toxic content. Experimental results demonstrate that
the proposed method outperforms conventional alignment techniques for safety
instruction following, while maintaining superior efficiency.

通过暴露大型语言模型存在的缺陷输出并进行彻底评估，该研究提出了一种根据错误分析的新型对齐策略，以完全理解其内部原因，并将有害回应转化为模型对齐的指令调整语料库，从而不仅使 LLMs 不再产生有缺陷的回应，还可训练其自我批评，并利用其判别有毒内容的内在能力，实验结果表明，该方法在安全指令跟踪方面优于传统对齐技术，同时保持卓越的效率。

从挫折中获益：通过错误分析对齐大型语言模型

Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake  Analysis

Current language models achieve low perplexity but their resulting
generations still suffer from toxic responses, repetitiveness and
contradictions. The standard language modeling setup fails to address these
issues. In this paper, we introduce a new architecture, {\sc Director}, that
consists of a unified generator-classifier with both a language modeling and a
classification head for each output token. Training is conducted jointly using
both standard language modeling data, and data labeled with desirable and
undesirable sequences. Experiments in several settings show that the model has
competitive training and decoding speed compared to standard language models
while yielding superior results, alleviating known issues while maintaining
generation quality. It also outperforms existing model guiding approaches in
terms of both accuracy and efficiency.

本文介绍了一个新的基于统一生成器 - 分类器框架的 Director 语言模型，该模型结合语言建模和分类学习，并使用包括有利和不利序列标记的数据进行训练，实验证明该模型相较于标准语言模型可以大幅减少毒瘤响应、重复性、矛盾等问题，在保持生成质量的同时，训练和解码速度也相对较快。此外，该模型在准确性和效率方面均优于现有的模型指导方法。