BriefGPT.xyz
Feb, 2025
监狱逃脱防御的工作原理及其集成机制研究
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
HTML
PDF
Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang...
TL;DR
本研究针对监狱逃脱攻击,分析现有防御方法在安全性与实用性之间的权衡,尤其是在大型视觉语言模型的应用中。作者提出了安全性转移和有害性辨别两种主要防御机制,并基于此开发了交互机制集成和内部机制集成等策略,以优化安全性与实用性的平衡。实验证明这些策略有效提升了模型的安全性。
Abstract
Jailbreak Attacks
, where harmful prompts bypass
Generative Models
' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between
→