BriefGPT.xyz
Jul, 2024
DART: 深度对抗自动红队针对LLM安全
DART: Deep Adversarial Automated Red Teaming for LLM Safety
HTML
PDF
Bojian Jiang, Yi Jing, Tianhao Shen, Qing Yang, Deyi Xiong
TL;DR
使用深度对抗自动化红队技术(DART)框架,在目标大型语言模型(LLM)的动态演进过程中,通过红色LLM自动生成对抗性提示,监控全局攻击多样性,并通过主动学习数据选择机制来提高目标LLM的安全性,从而显著降低了目标LLM的安全风险。
Abstract
manual red teaming
is a commonly-used method to identify vulnerabilities in large language models (LLMs), which, is costly and unscalable. In contrast,
automated red teaming
uses a Red LLM to automatically genera
→