The increasing demand for customized Large Language Models (LLMs) has led to
the development of solutions like GPTs. These solutions facilitate tailored LLM
creation via natural language prompts without coding. However, the
trustworthiness of third-party custom versions of LLMs remains an essential
concern. In this paper, we propose the first instruction backdoor attacks
against applications integrated with untrusted customized LLMs (e.g., GPTs).
Specifically, these attacks embed the backdoor into the custom version of LLMs
by designing prompts with backdoor instructions, outputting the attacker's
desired result when inputs contain the pre-defined triggers. Our attack
includes 3 levels of attacks: word-level, syntax-level, and semantic-level,
which adopt different types of triggers with progressive stealthiness. We
stress that our attacks do not require fine-tuning or any modification to the
backend LLMs, adhering strictly to GPTs development guidelines. We conduct
extensive experiments on 4 prominent LLMs and 5 benchmark text classification
datasets. The results show that our instruction backdoor attacks achieve the
desired attack performance without compromising utility. Additionally, we
propose an instruction-ignoring defense mechanism and demonstrate its partial
effectiveness in mitigating such attacks. Our findings highlight the
vulnerability and the potential risks of LLM customization such as GPTs.

我们的研究论文首次提出了针对与不受信任的定制大型语言模型（例如 GPTs）集成的应用程序的指令后门攻击，这些攻击通过设计带有后门指令的提示将后门嵌入到定制的语言模型中，并在输入包含预定义触发器时输出攻击者所需的结果。我们的研究结果强调了定制化语言模型（如 GPTs）的脆弱性和潜在风险。