Releasing open-source large language models (LLMs) presents a dual-use risk
since bad actors can easily fine-tune these models for harmful purposes. Even
without the open release of weights, weight stealing and fine-tuning APIs make
closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety
measures like preventing jailbreaks and improving safety guardrails are
important, such measures can easily be reversed through fine-tuning. In this
work, we propose Representation Noising (RepNoise), a defence mechanism that is
effective even when attackers have access to the weights and the defender no
longer has any control. RepNoise works by removing information about harmful
representations such that it is difficult to recover them during fine-tuning.
Importantly, our defence is also able to generalize across different subsets of
harm that have not been seen during the defence process. Our method does not
degrade the general capability of LLMs and retains the ability to train the
model on harmless tasks. We provide empirical evidence that the effectiveness
of our defence lies in its "depth": the degree to which information about
harmful representations is removed across all layers of the LLM.

我们提出了一种名为 Representation Noising (RepNoise) 的防御机制，它能在攻击者具有权重且防御者无法控制的情况下，有效地消除有害表示的信息，从而使恶意微调变得困难，并能在不同的有害子集上泛化，同时不降低大型语言模型的一般能力。

表征加噪有效地预防语言模型的有害微调

Representation noising effectively prevents harmful fine-tuning on LLMs

A growing ecosystem of large, open-source foundation models has reduced the
labeled data and technical expertise necessary to apply machine learning to
many new problems. Yet foundation models pose a clear dual-use risk,
indiscriminately reducing the costs of building both harmful and beneficial
machine learning systems. To mitigate this risk, we propose the task blocking
paradigm, in which foundation models are trained with an additional mechanism
to impede adaptation to harmful tasks while retaining good performance on
desired tasks. We call the resulting models self-destructing models, inspired
by mechanisms that prevent adversaries from using tools for harmful purposes.
We present an algorithm for training self-destructing models leveraging
techniques from meta-learning and adversarial learning, showing that it can
largely prevent a BERT-based model from learning to perform gender
identification without harming the model's ability to perform profession
classification. We conclude with a discussion of future directions.

该研究提出了一种名为「任务屏蔽」的新的训练范式，使用元学习和对抗学习的技术训练出一种自毁机制的基础模型来预防对有害任务的适应，降低其潜在风险。