The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

本研究解决了大型语言模型（LLMs）安全对齐过程中的脆弱性问题，提出模板锚定安全对齐是造成这些模型易受攻击的关键因素。研究表明，通过将安全机制与模板区域分离，能够有效降低模型对越狱攻击的脆弱性，从而为未来的研究提供了新的思路。

为什么安全保障的船只会搁浅？大型语言模型的安全机制往往受限于模板区域