Large Language Model (LLM) services and models often come with legal rules on
who can use them and how they must use them. Assessing the compliance of the
released LLMs is crucial, as these rules protect the interests of the LLM
contributor and prevent misuse. In this context, we describe the novel problem
of Black-box Identity Verification (BBIV). The goal is to determine whether a
third-party application uses a certain LLM through its chat function. We
propose a method called Targeted Random Adversarial Prompt (TRAP) that
identifies the specific LLM in use. We repurpose adversarial suffixes,
originally proposed for jailbreaking, to get a pre-defined answer from the
target LLM, while other models give random answers. TRAP detects the target
LLMs with over 95% true positive rate at under 0.2% false positive rate even
after a single interaction. TRAP remains effective even if the LLM has minor
changes that do not significantly alter the original function.

通过使用名为 TRAP 的方法，本研究介绍了一种新颖的黑盒身份验证问题，该方法可以检测出特定的大型语言模型 (LLM) 在第三方应用程序中的使用，以确保 LLM 的合规性和防止滥用。TRAP 方法使用对越狱提出的敌对后缀，从目标 LLM 获取预定义答案，而其他模型则给出随机答案。TRAP 在仅进行一次交互后，可以以超过 95% 的真阳性率和 0.2% 以下的假阳性率检测到目标 LLMs。即使 LLM 进行了微小变化且原始功能未明显改变，TRAP 仍然有效。