Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention.

本研究研究了Large Language Models（LLMs）中存在的内容限制和潜在误用的挑战，并调查了与破解LLMs相关的三个关键问题：不同提示类型的数量、提示对抗LLMs限制的有效性以及ChatGPT对这些提示的鲁棒性。该研究根据分类模型分析现有提示的分布，识别了10种不同模式和三种破解提示类别。此外，研究利用8120个问题的数据集，评估了ChatGPT版本3.5和4.0中破解提示的能力，最终发现提示可以在40个用例场景中始终逃脱限制。该研究强调了提示结构在破解LLMs中的重要性，并讨论了生成和防止鲁棒破解提示的挑战。

通过提示工程实现ChatGPT越狱：一项实证研究