Caution: This paper includes offensive words that could potentially cause
unpleasantness. Language models (LMs) are vulnerable to exploitation for
adversarial misuse. Training LMs for safety alignment is extensive and makes it
hard to respond to fast-developing attacks immediately, such as jailbreaks. We
propose self-refine with formatting that achieves outstanding safety even in
non-safety-aligned LMs and evaluate our method alongside several defense
baselines, demonstrating that it is the safest training-free method against
jailbreak attacks. Additionally, we proposed a formatting method that improves
the efficiency of the self-refine process while reducing attack success rates
in fewer iterations. We've also observed that non-safety-aligned LMs outperform
safety-aligned LMs in safety tasks by giving more helpful and safe responses.
In conclusion, our findings can achieve less safety risk with fewer
computational costs, allowing non-safety LM to be easily utilized in real-world
service.

我们提出了一种自我改进的格式化方法，即使在非安全对齐的语言模型中也能实现出色的安全性，通过将我们的方法与几种防御基线进行评估，证明它是针对越狱攻击最安全的无需训练的方法。此外，我们提出了一种格式化方法，可以在更少的迭代中提高自我改进过程的效率，同时降低攻击成功率。我们还观察到，在安全任务中，非安全对齐的语言模型比安全对齐的语言模型表现更好，给出更有帮助且安全的回应。总之，我们的研究发现可以在减少计算成本的同时减少安全风险，使非安全的语言模型可以在真实世界的服务中轻松应用。