In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that lead to undesirable outputs. The inherent vulnerability of LLMs stems from their input-output mechanisms, especially when presented with intensely out-of-distribution (OOD) inputs. This paper proposes a token-level detection method to identify adversarial prompts, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity and incorporate neighboring token information to encourage the detection of contiguous adversarial prompt sequences. As a result, we propose two methods: one that identifies each token as either being part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.

本文提出了一种基于令牌级别检测方法来识别对抗提示的方法，利用大型语言模型的能力来预测下一个令牌的概率，测量模型的困惑度并结合相邻令牌信息，以鼓励检测连续的对抗提示序列，提出了两种方法：一种将每个令牌识别为是否属于对抗提示的一部分，另一种估计每个令牌属于对抗提示的概率。

基于困惑度度量和上下文信息的标记级对抗性提示检测