Safety alignment is the key to guiding the behaviors of large language models (LLMs) that are in line with human preferences and restrict harmful behaviors at inference time, but recent studies show that it can be easily compromised by finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community.

通过测量和可视化大型语言模型（LLMs）的安全景观，我们发现了一种称为“安全盆地”的普遍现象，该现象在流行的开源LLMs模型参数空间中观察到。我们提出了一种新的安全度量标准，VISAGE安全度量标准，用于通过探测安全景观来衡量LLMs微调的安全性，并通过可视化的安全景观了解LLMs通过微调如何降低其安全性。LLMs的安全景观还突出了系统提示在保护模型中的关键作用，并且这种保护通过其在安全盆地内的扰动变体进行传递。我们的安全景观研究的观察结果为未来关于LLMs安全性的工作提供了新的见解。

在大型语言模型的优化过程中测量风险：导航安全景观