Large Language Models (LLMs) require careful safety alignment to prevent
malicious outputs. While significant research focuses on mitigating harmful
content generation, the enhanced safety often come with the side effect of
over-refusal, where the LLMs may reject innocuous prompts and become less
helpful. Although the issue of over-refusal has been empirically observed, a
systematic measurement is challenging due to the difficulty of crafting prompts
that appear harmful but are benign. This study proposes a novel method for
automatically generating large-scale sets of ``seemingly toxic prompts''
(benign prompts likely rejected by LLMs). Leveraging this technique, we
introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench
comprises 80,000 seemingly toxic prompts across 10 common rejection categories,
a subset of around 1,000 hard prompts that are challenging even for
state-of-the-art LLMs, and an additional 600 toxic prompts to prevent
indiscriminate responses. We then conduct a comprehensive study to measure the
over-refusal of 25 popular LLMs across 8 model families. Our datasets are
available at this https URL and the
corresponding demo can be found at
this https URL We hope this benchmark can
help the community develop better safety aligned models.

通过自动生成大规模的看似有害的提示，该研究提出了 OR-Bench，首个大规模的拒绝基准，用于度量 25 个热门 LLM 模型的过度拒绝。

OR-Bench：大型语言模型的拒绝过度基准

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Current LLMs are generally aligned to follow safety requirements and tend to
refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be
overcautious and refuse benign examples. In addition, state-of-the-art toxicity
detectors have low TPRs at low FPR, incurring high costs in real-world
applications where toxic examples are rare. In this paper, we explore
Moderation Using LLM Introspection (MULI), which detects toxic prompts using
the information extracted directly from LLMs themselves. We found significant
gaps between benign and toxic prompts in the distribution of alternative
refusal responses and in the distribution of the first response token's logits.
These gaps can be used to detect toxicities: We show that a toy model based on
the logits of specific starting tokens gets reliable performance, while
requiring no training or additional computational cost. We build a more robust
detector using a sparse logistic regression model on the first response token
logits, which greatly exceeds SOTA detectors under multiple metrics.

使用 LLMs 自身提取的信息，通过查找替代拒绝响应和首个响应标记的 logits 分布中的有毒提示之间的显著差距，我们提出了一种新的毒性侦测模型 MULI，它不需要训练或额外计算成本，并基于首个响应标记的 logits 构建了一个更强大的检测器，其性能超过了现有技术下的多个指标。

毒性检测自由

Toxicity Detection for Free

Large language models (LLMs) can elicit social bias during generations,
especially when inference with toxic prompts. Controlling the sensitive
attributes in generation encounters challenges in data distribution,
generalizability, and efficiency. Specifically, fine-tuning and retrieval
demand extensive unbiased corpus, while direct prompting requires meticulously
curated instructions for correcting the output in multiple rounds of thoughts
but poses challenges on memory and inference latency. In this work, we propose
the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED)
to eliminate the undesired harmful outputs for LLMs without the aforementioned
requirements. EXPOSED constructs a debiasing expert based on the abundant toxic
corpus to expose and elicit the potentially dangerous tokens. It then processes
the output to the LLMs and constructs a fair distribution by suppressing and
attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over
three LLM families. Extensive experiments demonstrate that compared with other
baselines, the proposed EXPOSED significantly reduces the potential social bias
while balancing fairness and generation performance.

通过使用 DESM 提供的网站，您可以在输入框中输入您的想法或问题，然后 DES 将基于这个输入提供一个自动生成的并针对您输入的文本进行适当恢复或继续的建议。