BriefGPT.xyz
Feb, 2025
通过调查者代理引发语言模型行为
Eliciting Language Model Behaviors with Investigator Agents
HTML
PDF
Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang...
TL;DR
本研究解决了在自由文本提示下语言模型行为引发的问题,旨在寻找能够引发特定目标行为(如幻觉或有害反应)的提示。通过训练调查者模型,我们提出了一种新颖的方法,能够映射随机选择的目标行为至多样化的输出提示,从而实现有效的行为引发,并在部分测试集上实现了100%的攻击成功率和85%的幻觉率。
Abstract
language models
exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of
behavior elicitation
, where the goa
→