As large language models (LLMs) have increased in their capabilities, so does
their potential for dual use. To reduce harmful outputs, produces and vendors
of LLMs have used reinforcement learning with human feedback (RLHF). In tandem,
LLM vendors have been increasingly enabling fine-tuning of their most powerful
models. However, concurrent work has shown that fine-tuning can remove RLHF
protections. We may expect that the most powerful models currently available
(GPT-4) are less susceptible to fine-tuning attacks.
In this work, we show the contrary: fine-tuning allows attackers to remove
RLHF protections with as few as 340 examples and a 95% success rate. These
training examples can be automatically generated with weaker models. We further
show that removing RLHF protections does not decrease usefulness on
non-censored outputs, providing evidence that our fine-tuning strategy does not
decrease usefulness despite using weaker models to generate training data. Our
results show the need for further research on protections on LLMs.

精细调整大型语言模型（LLM）的 RLHF 保护可能性，使用较弱模型生成的训练数据可以有效地移除 RLHF 保护，但不会降低其在非审查输出上的有用性，表明对 LLMs 的保护需要进一步研究。