Large Language Models (LLMs) enable a new ecosystem with many downstream
applications, called LLM applications, with different natural language
processing tasks. The functionality and performance of an LLM application
highly depend on its system prompt, which instructs the backend LLM on what
task to perform. Therefore, an LLM application developer often keeps a system
prompt confidential to protect its intellectual property. As a result, a
natural attack, called prompt leaking, is to steal the system prompt from an
LLM application, which compromises the developer's intellectual property.
Existing prompt leaking attacks primarily rely on manually crafted queries, and
thus achieve limited effectiveness.
In this paper, we design a novel, closed-box prompt leaking attack framework,
called PLeak, to optimize an adversarial query such that when the attacker
sends it to a target LLM application, its response reveals its own system
prompt. We formulate finding such an adversarial query as an optimization
problem and solve it with a gradient-based method approximately. Our key idea
is to break down the optimization goal by optimizing adversary queries for
system prompts incrementally, i.e., starting from the first few tokens of each
system prompt step by step until the entire length of the system prompt.
We evaluate PLeak in both offline settings and for real-world LLM
applications, e.g., those on Poe, a popular platform hosting such applications.
Our results show that PLeak can effectively leak system prompts and
significantly outperforms not only baselines that manually curate queries but
also baselines with optimized queries that are modified and adapted from
existing jailbreaking attacks. We responsibly reported the issues to Poe and
are still waiting for their response. Our implementation is available at this
repository: this https URL

设计了一种新颖的闭盒信息泄露攻击框架 PLeak，用于优化对抗查询，以便当攻击者将其发送到目标 LLM 应用程序时，其响应会泄露自己的系统提示。通过逐步优化系统提示的每个令牌的对抗性查询，有效地泄露系统提示，并显著优于手动策划查询和修改自现有越狱攻击的优化查询。

PLeak：大规模语言模型应用中的提示泄露攻击

PLeak: Prompt Leaking Attacks against Large Language Model Applications

Transformer-based large language models (LLMs) provide a powerful foundation
for natural language tasks in large-scale customer-facing applications.
However, studies that explore their vulnerabilities emerging from malicious
user interaction are scarce. By proposing PromptInject, a prosaic alignment
framework for mask-based iterative adversarial prompt composition, we examine
how GPT-3, the most widely deployed language model in production, can be easily
misaligned by simple handcrafted inputs. In particular, we investigate two
types of attacks -- goal hijacking and prompt leaking -- and demonstrate that
even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit
GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject
is available at this https URL.

使用 PromptInject 对 GPT-3 进行了安全性评估，发现针对 goal hijacking 和 prompt leaking 的手工输入攻击可以利用 GPT-3 的随机性，导致潜在的风险