Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. In this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risk brought by RAG on the retrieval data, we further reveal that RAG can mitigate the leakage of the LLMs' training data. Overall, we provide new insights in this paper for privacy protection of retrieval-augmented LLMs, which benefit both LLMs and RAG systems builders. Our code is available at https://github.com/phycholosogy/RAG-privacy.

使用检索增强生成（RAG）技术可以增强具有专有和私有数据的语言模型，在这种情况下，数据隐私是关键问题。本研究对检索增强生成系统进行了广泛的实证研究，并提出新的攻击方法来揭示其对私有检索数据库的泄露漏洞。尽管RAG技术存在新的风险，但它可以减轻语言模型的训练数据泄露问题，为检索增强语言模型的隐私保护提供了新的见解，对语言模型和RAG系统构建者都具有益处。

检索增强生成（RAG）中隐私问题的探索