BriefGPT.xyz
Mar, 2025
安全回溯
Backtracking for Safety
HTML
PDF
Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin...
TL;DR
本研究解决了大型语言模型(LLMs)在生成过程中安全性和对人类价值观的对齐问题,现有方法在应对安全漏洞时效果有限。我们提出了一种新的回溯方法,可以在生成过程中出现安全违规时恢复到更安全的状态,显著减少生成过程中的毒性,同时保持生成效率。
Abstract
Large Language Models
(LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current
Safety Alignment
methods, such as su
→