Large language models (LLMs) show inherent brittleness in their safety
mechanisms, as evidenced by their susceptibility to jailbreaking and even
non-malicious fine-tuning. This study explores this brittleness of safety
alignment by leveraging pruning and low-rank modifications. We develop methods
to identify critical regions that are vital for safety guardrails, and that are
disentangled from utility-relevant regions at both the neuron and rank levels.
Surprisingly, the isolated regions we find are sparse, comprising about $3\%$
at the parameter level and $2.5\%$ at the rank level. Removing these regions
compromises safety without significantly impacting utility, corroborating the
inherent brittleness of the model's safety mechanisms. Moreover, we show that
LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications
to the safety-critical regions are restricted. These findings underscore the
urgent need for more robust safety strategies in LLMs.

利用剪枝和低秩修改探索大型语言模型的安全与稳健性，发现关键区域的剔除威胁了安全性但对效用影响不大，同时指出即使限制对关键区域的修改，大型语言模型仍然容易受到低成本的微调攻击，强调了对大型语言模型更强健的安全策略的迫切需求。

通过修剪和低秩修改评估安全对齐的脆弱性

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank  Modifications

Recent studies reveal that Autonomous Vehicles (AVs) can be manipulated by
hidden backdoors, causing them to perform harmful actions when activated by
physical triggers. However, it is still unclear how these triggers can be
activated while adhering to traffic principles. Understanding this
vulnerability in a dynamic traffic environment is crucial. This work addresses
this gap by presenting physical trigger activation as a reachability problem of
controlled dynamic system. Our technique identifies security-critical areas in
traffic systems where trigger conditions for accidents can be reached, and
provides intended trajectories for how those conditions can be reached. Testing
on typical traffic scenarios showed the system can be successfully driven to
trigger conditions with near 100% activation rate. Our method benefits from
identifying AV vulnerability and enabling effective safety strategies.

本研究揭示自动驾驶车辆（AVs）面临潜在的隐藏后门威胁，提出了在交通系统中识别激活风险区域及提供相应轨迹的方法，旨在提高 AVs 的安全性和解决其面临的漏洞问题。