Instruction-tuned Large Language Models (LLMs) have achieved breakthrough results, opening countless new possibilities for many practical applications. However, LLMs lack elementary safety features that are established norms in other areas of computer science, such as the separation between instructions and data, causing them to malfunction or rendering them vulnerable to manipulation and interference by third parties e.g., via indirect prompt/command injection. Even worse, so far, there is not even an established definition of what precisely such a separation would mean and how its violation could be tested. In this work, we aim to close this gap. We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also introduce a new dataset, SEP (Should it be Executed or Processed?), which allows estimating the measure, and we report results on several state-of-the-art open-source and closed LLMs. Finally, we quantitatively demonstrate that all evaluated LLMs fail to achieve a high amount of separation, according to our measure. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.

我们介绍了一种量化指令和数据分离现象的形式化测量方法，以及可以从模型的黑盒输出计算的经验性变量。我们还引入了一个名为SEP（应该执行还是处理？）的新数据集，并对几种最先进的开源和闭源大语言模型进行了测试。最后，我们定量证明所有评估的大语言模型都无法实现高度的分离，根据我们的测量方法。

LLM能将指令与数据分离吗？我们用这个说法究竟是什么意思？