Recent remarkable advancements in large language models (LLMs) have led to
their widespread adoption in various applications. A key feature of these
applications is the combination of LLMs with external content, where user
instructions and third-party content are combined to create prompts for LLM
processing. These applications, however, are vulnerable to indirect prompt
injection attacks, where malicious instructions embedded within external
content compromise LLM's output, causing their responses to deviate from user
expectations. Despite the discovery of this security issue, no comprehensive
analysis of indirect prompt injection attacks on different LLMs is available
due to the lack of a benchmark. Furthermore, no effective defense has been
proposed.
In this work, we introduce the first benchmark, BIPIA, to measure the
robustness of various LLMs and defenses against indirect prompt injection
attacks. Our experiments reveal that LLMs with greater capabilities exhibit
more vulnerable to indirect prompt injection attacks for text tasks, resulting
in a higher ASR. We hypothesize that indirect prompt injection attacks are
mainly due to the LLMs' inability to distinguish between instructions and
external content. Based on this conjecture, we propose four black-box methods
based on prompt learning and a white-box defense methods based on fine-tuning
with adversarial training to enable LLMs to distinguish between instructions
and external content and ignore instructions in the external content. Our
experimental results show that our black-box defense methods can effectively
reduce ASR but cannot completely thwart indirect prompt injection attacks,
while our white-box defense method can reduce ASR to nearly zero with little
adverse impact on the LLM's performance on general tasks. We hope that our
benchmark and defenses can inspire future work in this important area.

通过使用第一个基准 BIPIA 来评估不同大型语言模型的鲁棒性和对间接提示注入攻击的防御方法，我们发现具有更高能力的大型语言模型在文本任务中更容易受到间接提示注入攻击，导致 ASR 更高。在此基础上，我们提出了基于提示学习的四种黑盒方法和基于对抗训练的白盒防御方法，使大型语言模型能够区分指令和外部内容，并忽略外部内容中的指令。实验结果表明，我们的黑盒防御方法可以有效降低 ASR，但无法完全阻止间接提示注入攻击，而我们的白盒防御方法可以将 ASR 几乎降低到零，对大型语言模型在一般任务上的性能影响较小。我们希望我们的基准和防御方法能够激发未来在这一重要领域的研究工作。