The advancement of Large Language Models (LLMs) has significantly impacted various domains, including Web search, healthcare, and software development. However, as these models scale, they become more vulnerable to cybersecurity risks, particularly backdoor attacks. By exploiting the potent memorization capacity of LLMs, adversaries can easily inject backdoors into LLMs by manipulating a small portion of training data, leading to malicious behaviors in downstream applications whenever the hidden backdoor is activated by the pre-defined triggers. Moreover, emerging learning paradigms like instruction tuning and reinforcement learning from human feedback (RLHF) exacerbate these risks as they rely heavily on crowdsourced data and human feedback, which are not fully controlled. In this paper, we present a comprehensive survey of emerging backdoor threats to LLMs that appear during LLM development or inference, and cover recent advancement in both defense and detection strategies for mitigating backdoor threats to LLMs. We also outline key challenges in addressing these threats, highlighting areas for future research.

本研究旨在解决大语言模型（LLMs）面临的后门攻击问题，这些攻击因模型规模扩大而愈发严重。论文提出了一种全面的调查，涵盖了LLMs在发展和推理过程中出现的后门威胁，以及最新的防御与检测策略。研究的主要发现是，尽管已有进展，但在应对这些威胁方面仍面临许多挑战，需进一步研究。

减轻大语言模型的后门威胁：进展与挑战