Safety-aligned language models often exhibit fragile and imbalanced safety
mechanisms, increasing the likelihood of generating unsafe content. In
addition, incorporating new knowledge through editing techniques to language
models can further compromise safety. To address these issues, we propose
SafeInfer, a context-adaptive, decoding-time safety alignment strategy for
generating safe responses to user queries. SafeInfer comprises two phases: the
safety amplification phase, which employs safe demonstration examples to adjust
the model's hidden states and increase the likelihood of safer outputs, and the
safety-guided decoding phase, which influences token selection based on
safety-optimized distributions, ensuring the generated content complies with
ethical guidelines. Further, we present HarmEval, a novel benchmark for
extensive safety evaluations, designed to address potential misuse scenarios in
accordance with the policies of leading AI tech giants.

通过 SafeInfer 方法中的安全放大和安全引导解码阶段以及 HarmEval 评估，此篇研究论文旨在解决安全性不足、知识编辑引入风险等问题，提供安全的回应输出并遵守伦理指南。