As large language models (LLMs) constantly evolve, ensuring their safety
remains a critical research problem. Previous red-teaming approaches for LLM
safety have primarily focused on single prompt attacks or goal hijacking. To
the best of our knowledge, we are the first to study LLM safety in multi-turn
dialogue coreference. We created a dataset of 1,400 questions across 14
categories, each featuring multi-turn coreference safety attacks. We then
conducted detailed evaluations on five widely used open-source LLMs. The
results indicated that under multi-turn coreference safety attacks, the highest
attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was
13.9% with the Mistral-7B-Instruct model. These findings highlight the safety
vulnerabilities in LLMs during dialogue coreference interactions.

LLM 对话共指中的安全性漏洞研究，包括创建了一个包含 1,400 个问题的数据集，并在五种常用的开源 LLM 模型上进行了评估，结果表明在多轮共指安全攻击下，LLaMA2-Chat-7b 模型具有最高的攻击成功率为 56％，而 Mistral-7B-Instruct 模型具有最低的攻击成功率为 13.9％。