The advent of Large Language Models (LLMs) has marked significant
achievements in language processing and reasoning capabilities. Despite their
advancements, LLMs face vulnerabilities to data poisoning attacks, where
adversaries insert backdoor triggers into training data to manipulate outputs
for malicious purposes. This work further identifies additional security risks
in LLMs by designing a new data poisoning attack tailored to exploit the
instruction tuning process. We propose a novel gradient-guided backdoor trigger
learning approach to identify adversarial triggers efficiently, ensuring an
evasion of detection by conventional defenses while maintaining content
integrity. Through experimental validation across various LLMs and tasks, our
strategy demonstrates a high success rate in compromising model outputs;
poisoning only 1\% of 4,000 instruction tuning samples leads to a Performance
Drop Rate (PDR) of around 80\%. Our work highlights the need for stronger
defenses against data poisoning attack, offering insights into safeguarding
LLMs against these more sophisticated attacks. The source code can be found on
this GitHub repository: this https URL

通过设计一种新的数据污染攻击，本研究进一步识别了 LLMs 中的安全风险，并提出了一种梯度引导的后门触发器学习方法，以高效地识别对手的触发器，并确保对传统防御的逃避，同时保持内容完整性。