Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes particularly pronounced in the context of Large Language Models (LLMs) deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, our work introduces defensive demonstrations, an innovative backdoor defense strategy for blackbox large language models. Our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications/tuning to the black-box model or insights into its internal mechanisms. Defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations. Extensive experiments show that defensive demonstrations are effective in defending both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.

针对大型语言模型在黑盒环境下的后门攻击问题，我们提出了一种新颖的防御策略，即防御演示。我们的方法通过从未受污染的数据集中选择任务相关的演示案例，并将它们与用户查询一起用于测试，无需修改/调整黑盒模型或了解其内部机制，从而有效对抗后门攻击并在大多数场景中优于现有基准。

黑盒大型语言模型的测试时间防后门干预