Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge, making them adaptable and cost-effective for various applications. However, the growing reliance on these systems also introduces potential security risks. In this work, we reveal a novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which enables attackers to manipulate the retrieval mechanisms of RAG systems by injecting malicious texts into the knowledge database. When the RAG system encounters target questions, it generates the attacker's pre-determined answers instead of the correct ones, undermining the integrity and trustworthiness of the system. We formalize HijackRAG as an optimization problem and propose both black-box and white-box attack strategies tailored to different levels of the attacker's knowledge. Extensive experiments on multiple benchmark datasets show that HijackRAG consistently achieves high attack success rates, outperforming existing baseline attacks. Furthermore, we demonstrate that the attack is transferable across different retriever models, underscoring the widespread risk it poses to RAG systems. Lastly, our exploration of various defense mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing the urgent need for more robust security measures to protect RAG systems in real-world deployments.

本研究揭示了一种新的安全漏洞，称为检索提示劫持攻击（HijackRAG），其允许攻击者通过向知识数据库注入恶意文本来操控检索增强生成（RAG）系统，从而生成错误答案而非正确答案。我们提出了针对不同攻击者知识水平的黑箱和白箱攻击策略，并通过大量实验表明，HijackRAG在多种基准数据集上成功率较高，且跨不同检索模型可转移，凸显了其对RAG系统的广泛风险。

HijackRAG：针对检索增强大语言模型的劫持攻击