AI developers often apply safety alignment procedures to prevent the misuse
of their AI systems. For example, before Meta released Llama 2-Chat, a
collection of instruction fine-tuned large language models, they invested
heavily in safety training, incorporating extensive red-teaming and
reinforcement learning from human feedback. However, it remains unclear how
well safety training guards against model misuse when attackers have access to
model weights. We explore the robustness of safety training in language models
by subversively fine-tuning the public weights of Llama 2-Chat. We employ
low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of
less than $200 per model and using only one GPU, we successfully undo the
safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically,
our fine-tuning technique significantly reduces the rate at which the model
refuses to follow harmful instructions. We achieve a refusal rate below 1% for
our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method
retains general performance, which we validate by comparing our fine-tuned
models against Llama 2-Chat across two benchmarks. Additionally, we present a
selection of harmful outputs produced by our models. While there is
considerable uncertainty about the scope of risks from current models, it is
likely that future models will have significantly more dangerous capabilities,
including the ability to hack into critical infrastructure, create dangerous
bio-weapons, or autonomously replicate and adapt to new environments. We show
that subversive fine-tuning is practical and effective, and hence argue that
evaluating risks from fine-tuning should be a core part of risk assessments for
releasing model weights.

在研究中，我们通过秘密地微调公开权重，探索了语言模型安全训练的强壮性，成功降低了有害指令的拒绝率，证明了背离微调是切实可行和有效的。因此，我们主张在发布模型权重时，风险评估应将微调风险评估作为核心部分。