Large language models (LLMs) introduce new security risks, but there are few
comprehensive evaluation suites to measure and reduce these risks. We present
BenchmarkName, a novel benchmark to quantify LLM security risks and
capabilities. We introduce two new areas for testing: prompt injection and code
interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs,
including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our
results show that conditioning away risk of attack remains an unsolved problem;
for example, all tested models showed between 26% and 41% successful prompt
injection tests. We further introduce the safety-utility tradeoff: conditioning
an LLM to reject unsafe prompts can cause the LLM to falsely reject answering
benign prompts, which lowers utility. We propose quantifying this tradeoff
using False Refusal Rate (FRR). As an illustration, we introduce a novel test
set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to
successfully comply with "borderline" benign requests while still rejecting
most unsafe requests. Finally, we quantify the utility of LLMs for automating a
core cybersecurity task, that of exploiting software vulnerabilities. This is
important because the offensive capabilities of LLMs are of intense interest;
we quantify this by creating novel test sets for four representative problems.
We find that models with coding capabilities perform better than those without,
but that further work is needed for LLMs to become proficient at exploit
generation. Our code is open source and can be used to evaluate other LLMs.

大型语言模型引入新的安全风险，但缺乏综合评估套件来衡量和减少这些风险。我们提出了 BenchmarkName，这是一个用于量化 LLM 安全风险和能力的新型基准。我们介绍了两个新领域的测试：提示注入和代码解释器滥用。我们评估了多种最先进的 LLMs，包括 GPT-4、Mistral、Meta Llama 3 70B-Instruct 和 Code Llama。我们的结果表明，消除攻击风险的条件仍然是一个尚未解决的问题；例如，所有测试模型在成功的提示注入测试中显示出 26% 到 41% 之间的结果。我们进一步引入了安全效用权衡：将 LLM 条件化以拒绝不安全的提示可能导致 LLM 错误地拒绝回答良性提示，从而降低效用。我们建议使用 False Refusal Rate（FRR）来量化这种权衡。作为示例，我们引入了一个新的测试集来量化网络攻击有用性风险的 FRR。我们发现，许多 LLMs 能够与 “边界线” 良性请求成功地相符，同时拒绝大部分不安全的请求。最后，我们量化了 LLMs 在自动化核心网络安全任务（例如利用软件漏洞）方面的效用。这很重要，因为 LLMs 的进攻能力引起了极大的兴趣；我们通过为四个典型问题创建新的测试集来量化这一点。我们发现具有编码能力的模型优于无编码能力的模型，但 LLMs 在利用生成方面还需要进一步的工作。我们的代码是开源的，可以用于评估其他 LLMs。