Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
TL;DR通过使用具有 API 访问的远程语言模型构建具有更高概率发出有害字符串的对抗性示例,我们改进了之前的工作,并验证了我们的攻击在 GPT-3.5 和 OpenAI 的安全分类器上的有效性。
Abstract
Recent work has shown it is possible to construct adversarial examples that
cause an aligned language model to emit harmful strings or perform harmful
behavior. Existing attacks work either in the white-box setti