As Large Language Models (LLMs) increasingly become key components in various
AI applications, understanding their security vulnerabilities and the
effectiveness of defense mechanisms is crucial. This survey examines the
security challenges of LLMs, focusing on two main areas: Prompt Hacking and
Adversarial Attacks, each with specific types of threats. Under Prompt Hacking,
we explore Prompt Injection and Jailbreaking Attacks, discussing how they work,
their potential impacts, and ways to mitigate them. Similarly, we analyze
Adversarial Attacks, breaking them down into Data Poisoning Attacks and
Backdoor Attacks. This structured examination helps us understand the
relationships between these vulnerabilities and the defense strategies that can
be implemented. The survey highlights these security challenges and discusses
robust defensive frameworks to protect LLMs against these threats. By detailing
these security issues, the survey contributes to the broader discussion on
creating resilient AI systems that can resist sophisticated attacks.

大型语言模型是各种人工智能应用中的关键组件，理解它们的安全漏洞和防御机制的有效性至关重要。本文调查了 LLMs 的安全挑战，重点关注两个主要领域：Prompt Hacking 和 Adversarial Attacks，每个领域都有特定类型的威胁。通过对 Prompt Hacking 和 Adversarial Attacks 的分析，研究了它们的工作原理、潜在影响以及缓解方法。调查强调了这些安全挑战，并讨论了保护 LLMs 免受这些威胁的强大防御框架。通过详细阐述这些安全问题，调查为抵御复杂攻击的坚韧人工智能系统的构建提供了宝贵的讨论。

大型语言模型中的漏洞和保护探索：调查

Exploring Vulnerabilities and Protections in Large Language Models: A  Survey

Large language models LLMs like ChatGPT have reached the 100 Mio user barrier
in record time and might increasingly enter all areas of our life leading to a
diverse set of interactions between those Artificial Intelligence models and
humans. While many studies have discussed governance and regulations
deductively from first-order principles, few studies provide an inductive,
data-driven lens based on observing dialogues between humans and LLMs
especially when it comes to non-collaborative, competitive situations that have
the potential to pose a serious threat to people. In this work, we conduct a
user study engaging over 40 individuals across all age groups in price
negotiations with an LLM. We explore how people interact with an LLM,
investigating differences in negotiation outcomes and strategies. Furthermore,
we highlight shortcomings of LLMs with respect to their reasoning capabilities
and, in turn, susceptiveness to prompt hacking, which intends to manipulate the
LLM to make agreements that are against its instructions or beyond any
rationality. We also show that the negotiated prices humans manage to achieve
span a broad range, which points to a literacy gap in effectively interacting
with LLMs.

通过观察人类与大型语言模型（LLMs）的对话，本研究基于数据驱动的方法，对 LLMs 的治理和调节进行了归纳性分析，并探讨了非合作性、竞争性情境下的人机交互对人类可能构成的严重威胁以及利用 LLMs 的推理能力存在的不足和可操纵性。此外，研究还揭示了人类在与 LLMs 进行价格谈判时所达成的价格涵盖了广泛范围，指出了人类在有效与 LLMs 互动方面存在的文盲问题。

与 LLMS 的谈判：迅速入门、技能差距与推理缺陷

Negotiating with LLMS: Prompt Hacks, Skill Gaps, and Reasoning Deficits

Large Language Models (LLMs) are increasingly being deployed in interactive
contexts that involve direct user engagement, such as chatbots and writing
assistants. These deployments are increasingly plagued by prompt injection and
jailbreaking (collectively, prompt hacking), in which models are manipulated to
ignore their original instructions and instead follow potentially malicious
ones. Although widely acknowledged as a significant security threat, there is a
dearth of large-scale resources and quantitative studies on prompt hacking. To
address this lacuna, we launch a global prompt hacking competition, which
allows for free-form human input attacks. We elicit 600K+ adversarial prompts
against three state-of-the-art LLMs. We describe the dataset, which empirically
verifies that current LLMs can indeed be manipulated via prompt hacking. We
also present a comprehensive taxonomical ontology of the types of adversarial
prompts.

通过全球 prompt 黑客竞赛，我们描述了当前大规模语言模型可以通过 prompt 黑客而遭受攻击，提供了对三种最先进的大规模语言模型进行的 600K+ 对抗性 prompt 的数据集，并提出了对敌对 prompt 类型的综合分类本体论。