Deploying Large language models (LLMs) can pose hazards from harmful outputs
such as toxic or dishonest speech. Prior work has introduced tools that elicit
harmful outputs in order to identify and mitigate these risks. While this is a
valuable step toward securing language models, these approaches typically rely
on a pre-existing classifier for undesired outputs. This limits their
application to situations where the type of harmful behavior is known with
precision beforehand. However, this skips a central challenge of red teaming:
developing a contextual understanding of the behaviors that a model can
exhibit. Furthermore, when such a classifier already exists, red teaming has
limited marginal value because the classifier could simply be used to filter
training data or model outputs. In this work, we consider red teaming under the
assumption that the adversary is working from a high-level, abstract
specification of undesired behavior. The red team is expected to refine/extend
this specification and identify methods to elicit this behavior from the model.
Our red teaming framework consists of three steps: 1) Exploring the model's
behavior in the desired context; 2) Establishing a measurement of undesired
behavior (e.g., a classifier trained to reflect human evaluations); and 3)
Exploiting the model's flaws using this measure and an established red teaming
methodology. We apply this approach to red team GPT-2 and GPT-3 models to
systematically discover classes of prompts that elicit toxic and dishonest
statements. In doing so, we also construct and release the CommonClaim dataset
of 20,000 statements that have been labeled by human subjects as
common-knowledge-true, common-knowledge-false, or neither. Code is available at
this https URL CommonClaim
is available at this https URL

本研究基于高水平、抽象的不良行为规范，通过三步，即探索模型的行为、建立不良行为的衡量标准、利用该标准和既定的红队方法来利用模型缺陷，从而针对 GPT-2 和 GPT-3 模型进行红队演练，发现可激发有毒或不诚实言论的提示，同时构建并发布包含 20,000 条声明的 CommonClaim 数据集。