TL;DR通过参数化红队技术与非对齐性使得 Large Language Models (LLMs) 的安全性得到破坏,揭示模型中存在的潜在有害信息和偏见。
Abstract
red-teaming has been a widely adopted way to evaluate the harmfulness of
large language models (LLMs). It aims to jailbreak a model's safety behavior to
make it act as a helpful agent disregarding the harmfulness