BriefGPT.xyz
Nov, 2023
通过目标优先级保护大型语言模型抵御越狱攻击
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
HTML
PDF
Zhexin Zhang, Junxiao Yang, Pei Ke, Minlie Huang
TL;DR
通过将目标优先级整合到训练和推理阶段,我们提出了一种对抗越狱攻击的方法,显著降低了越狱攻击的成功率,并减少了大型语言模型的潜在安全风险。
Abstract
large language models
(LLMs) continue to advance in their capabilities, yet this progress is accompanied by a growing array of
safety risks
. While significant attention has been dedicated to exploiting weaknesses
→