Feb, 2024
通过语义平滑对抗越狱攻击的大型语言模型防御
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani...
TL;DR防止大语言模型遭受越狱攻击的 SEMANTICSMOOTH 防御机制,在语义攻击方面取得了最先进的鲁棒性成果,并在指导遵循基准测试中保持强大的名义性能。