The advent of Large Language Models (LLMs) has marked significant
achievements in language processing and reasoning capabilities. Despite their
advancements, LLMs face vulnerabilities to data poisoning attacks, where
adversaries insert backdoor triggers into training data to manipulate outputs
for malicious purposes. This work further identifies additional security risks
in LLMs by designing a new data poisoning attack tailored to exploit the
instruction tuning process. We propose a novel gradient-guided backdoor trigger
learning approach to identify adversarial triggers efficiently, ensuring an
evasion of detection by conventional defenses while maintaining content
integrity. Through experimental validation across various LLMs and tasks, our
strategy demonstrates a high success rate in compromising model outputs;
poisoning only 1\% of 4,000 instruction tuning samples leads to a Performance
Drop Rate (PDR) of around 80\%. Our work highlights the need for stronger
defenses against data poisoning attack, offering insights into safeguarding
LLMs against these more sophisticated attacks. The source code can be found on
this GitHub repository: this https URL

通过设计一种新的数据污染攻击，本研究进一步识别了 LLMs 中的安全风险，并提出了一种梯度引导的后门触发器学习方法，以高效地识别对手的触发器，并确保对传统防御的逃避，同时保持内容完整性。

在指导调整期间学习对大型语言模型进行毒化

Learning to Poison Large Language Models During Instruction Tuning

Large pre-trained language models (PLMs) have proven to be a crucial
component of modern natural language processing systems. PLMs typically need to
be fine-tuned on task-specific downstream datasets, which makes it hard to
claim the ownership of PLMs and protect the developer's intellectual property
due to the catastrophic forgetting phenomenon. We show that PLMs can be
watermarked with a multi-task learning framework by embedding backdoors
triggered by specific inputs defined by the owners, and those watermarks are
hard to remove even though the watermarked PLMs are fine-tuned on multiple
downstream tasks. In addition to using some rare words as triggers, we also
show that the combination of common words can be used as backdoor triggers to
avoid them being easily detected. Extensive experiments on multiple datasets
demonstrate that the embedded watermarks can be robustly extracted with a high
success rate and less influenced by the follow-up fine-tuning.

研究表明，通过在预训练模型中嵌入后门触发器作为水印的方式，可以保护知识产权并避免遗忘现象的发生，同时还提出了一种使用常见单词组合作为后门触发器的方法，并在多个数据集上进行了测试。