Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which
aim to extract harmful information by subtly modifying the attack query. As
defense mechanisms evolve, directly obtaining harmful information becomes
increasingly challenging for Jailbreaking attacks. In this work, inspired by
human practices of indirect context to elicit harmful information, we focus on
a new attack form called Contextual Interaction Attack. The idea relies on the
autoregressive nature of the generation process in LLMs. We contend that the
prior context--the information preceding the attack query--plays a pivotal role
in enabling potent Jailbreaking attacks. Specifically, we propose an approach
that leverages preliminary question-answer pairs to interact with the LLM. By
doing so, we guide the responses of the model toward revealing the 'desired'
harmful information. We conduct experiments on four different LLMs and
demonstrate the efficacy of this attack, which is black-box and can also
transfer across LLMs. We believe this can lead to further developments and
understanding of the context vector in LLMs.

大型语言模型对越狱攻击很容易受到攻击，本研究提出了一种基于上下文互动的攻击形式，通过操作模型的回应引导其透露有害信息。在四个不同的大型语言模型上进行实验证明了该攻击的有效性，并且该攻击可以在不同大型语言模型之间转移。