Black-box finetuning is an emerging interface for adapting state-of-the-art
language models to user needs. However, such access may also let malicious
actors undermine model safety. To demonstrate the challenge of defending
finetuning interfaces, we introduce covert malicious finetuning, a method to
compromise model safety via finetuning while evading detection. Our method
constructs a malicious dataset where every individual datapoint appears
innocuous, but finetuning on the dataset teaches the model to respond to
encoded harmful requests with encoded harmful responses. Applied to GPT-4, our
method produces a finetuned model that acts on harmful instructions 99% of the
time and avoids detection by defense mechanisms such as dataset inspection,
safety evaluations, and input/output classifiers. Our findings question whether
black-box finetuning access can be secured against sophisticated adversaries.

使用黑盒微调接口可以根据用户需求对最新的语言模型进行适应性调整，但此类访问可能使恶意行为者危害模型安全。为了证明防御微调接口的挑战，我们引入了隐蔽恶意微调方法，通过微调方法来危害模型安全并且躲避检测。我们的方法构建了一个恶意数据集，其中每个数据点看起来都很无害，但通过在此数据集上微调，模型学会对编码的有害请求作出有害响应。应用于 GPT-4 上，我们的方法产生了一个微调模型，99% 的时间执行有害指令且能够躲避数据集检查、安全评估和输入 / 输出分类器等防御机制。我们的发现质疑了黑盒微调访问是否能够抵御复杂对手。