BriefGPT.xyz
Sep, 2024
鲁棒性大语言模型保护的拒绝特征对抗训练
Robust LLM safeguarding via refusal feature adversarial training
HTML
PDF
Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda
TL;DR
本研究针对大语言模型(LLMs)对抗性攻击带来的脆弱性问题,提出了一种新的对抗训练算法——拒绝特征对抗训练(ReFAT)。通过模拟输入层次的攻击效果,该方法显著提高了多种流行LLMs在面对各种对抗性攻击时的鲁棒性,同时计算成本较现有方法大幅降低。
Abstract
Large Language Models
(LLMs) are vulnerable to adversarial
Attacks
that can elicit harmful responses. Defending against such
Attacks
remai
→