BriefGPT.xyz
Nov, 2024
通过信息冲突消除大型语言模型中的后门
Neutralizing Backdoors through Information Conflicts for Large Language Models
HTML
PDF
Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, Kwok-Yan Lam
TL;DR
本研究解决了大型语言模型(LLMs)易受后门攻击的问题,提出了一种新方法,通过内部和外部机制构建信息冲突来消除后门行为。实验结果表明,该方法能够将先进后门攻击的成功率降低高达98%,同时保持超过90%的干净数据准确性,显示出其在应对适应性后门攻击方面的强大鲁棒性。
Abstract
Large Language Models
(LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to
Bac
→