Today's LLMs are susceptible to prompt injections, jailbreaks, and other
attacks that allow adversaries to overwrite a model's original instructions
with their own malicious prompts. In this work, we argue that one of the
primary vulnerabilities underlying these attacks is that LLMs often consider
system prompts (e.g., text from an application developer) to be the same
priority as text from untrusted users and third parties. To address this, we
propose an instruction hierarchy that explicitly defines how models should
behave when instructions of different priorities conflict. We then propose a
data generation method to demonstrate this hierarchical instruction following
behavior, which teaches LLMs to selectively ignore lower-privileged
instructions. We apply this method to GPT-3.5, showing that it drastically
increases robustness -- even for attack types not seen during training -- while
imposing minimal degradations on standard capabilities.

今天的 LLMs 容易受到即时注入、越狱和其他攻击的影响，使得恶意提示可以覆盖模型的初始指令。本文提出一种指令层次结构，明确定义了在不同优先级指令冲突时模型应该如何行为，并提出了一种数据生成方法来展示这种层次指令遵循行为，教导 LLMs 有选择性地忽略低权限指令。我们将这种方法应用于 GPT-3.5 上，展示它显著增加了鲁棒性，甚至对训练期间未见的攻击类型，同时对标准能力的降低影响很小。

指令层次结构：训练 LLMs 优先处理特权指令

The Instruction Hierarchy: Training LLMs to Prioritize Privileged  Instructions

Large language models and AI chatbots have been at the forefront of
democratizing artificial intelligence. However, the releases of ChatGPT and
other similar tools have been followed by growing concerns regarding the
difficulty of controlling large language models and their outputs. Currently,
we are witnessing a cat-and-mouse game where users attempt to misuse the models
with a novel attack called prompt injections. In contrast, the developers
attempt to discover the vulnerabilities and block the attacks simultaneously.
In this paper, we provide an overview of these emergent threats and present a
categorization of prompt injections, which can guide future research on prompt
injections and act as a checklist of vulnerabilities in the development of LLM
interfaces. Moreover, based on previous literature and our own empirical
research, we discuss the implications of prompt injections to LLM end users,
developers, and researchers.

大语言模型和 AI 聊天机器人在使人工智能民主化方面处于前沿。然而，发布 ChatGPT 和其他类似工具后，人们越来越担心难以控制大语言模型及其输出的问题。目前，我们正目睹用户试图滥用这些模型而开展的一场猫鼠大战，新出现了一种名为提示注入的攻击方式。相反，开发人员试图同时发现这些漏洞并阻止攻击。在本文中，我们概述了这些新出现的威胁，并提供提示注入的分类，以指导未来有关提示注入的研究，并作为在 LLM 接口开发中漏洞检查清单。此外，基于先前的文献和我们自己的实证研究，我们还讨论了提示注入对 LLM 终端用户、开发人员和研究人员的影响。