BriefGPT.xyz
Oct, 2024
大型语言模型的多轮越狱攻击
Multi-round jailbreak attack on large language models
HTML
PDF
Yihua Zhou, Xiaochuan Shi
TL;DR
本研究旨在解决大型语言模型在应对越狱攻击时的安全隐患。这项研究提出了一种多轮越狱方法,通过将危险提示重新分解为一系列较少有害的子问题,成功绕过模型的安全检查。实验结果显示,该方法在处理越狱攻击时的成功率高达94%。
Abstract
Ensuring the
Safety
and alignment of
Large Language Models
(LLMs) with human values is crucial for generating responses that are beneficial to humanity. While LLMs have the capability to identify and avoid harmfu
→