Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude,
and Meta's LLaMa have shown remarkable capabilities in text generation.
However, their susceptibility to toxic prompts presents significant security
challenges. This paper investigates alignment techniques, including Supervised
Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to
mitigate these risks. We conduct an empirical study on refusal patterns across
nine LLMs, revealing that models with uniform refusal patterns, such as
Claude3, exhibit higher security. Based on these findings, we propose
self-distilling and cross-model distilling methods to enhance LLM security. Our
results show that these methods significantly improve refusal rates and reduce
unsafe content, with cross-model distilling achieving refusal rates close to
Claude3's 94.51%. These findings underscore the potential of distillation-based
alignment in securing LLMs against toxic prompts.

通过研究模型对有毒提示的脆弱性和拒绝模式的统计，提出了自我提炼和跨模型提炼的方法来提高大型语言模型的安全性和拒绝率的研究。