As Large Language Models (LLMs) and generative AI become more widespread, the
content safety risks associated with their use also increase. We find a notable
deficiency in high-quality content safety datasets and benchmarks that
comprehensively cover a wide range of critical safety areas. To address this,
we define a broad content safety risk taxonomy, comprising 13 critical risk and
9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new
dataset of approximately 26, 000 human-LLM interaction instances, complete with
human annotations adhering to the taxonomy. We plan to release this dataset to
the community to further research and to help benchmark LLM models for safety.
To demonstrate the effectiveness of the dataset, we instruction-tune multiple
LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS),
not only surpass or perform competitively with the state-of-the-art LLM-based
safety models and general purpose LLMs, but also exhibit robustness across
multiple jail-break attack categories. We also show how using
AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact
the performance of the aligned models on MT Bench scores. Furthermore, we
propose AEGIS, a novel application of a no-regret online adaptation framework
with strong theoretical guarantees, to perform content moderation with an
ensemble of LLM content safety experts in deployment

使用人工智能生成模型，我们定义了一个广泛的内容安全风险分类法，同时创建了一个新的数据集 AEGISSAFETYDATASET，用于研究和评估大型语言模型的安全性能。通过实验表明，我们提出的模型 AEGISSAFETYEXPERTS 不仅在多个安全风险类别中表现出色，而且在多个攻击类型下也显示出鲁棒性。此外，我们提出了 AEGIS 方法，利用一系列 LLM 内容安全专家进行内容安全检查。

AEGIS: 在线自适应 AI 内容安全审查与 LLM 专家集成

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM  Experts

From the perspective of content safety issues, alignment has shown to limit
large language models' (LLMs) harmful content generation. This intentional
method of reinforcing models to not respond to certain user inputs seem to be
present in many modern open-source instruction tuning datasets such as
OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned
model's performance affected by the presence of alignment in supervised
fine-tuning dataset. To be specific, we noticed that alignment acts as if it is
poisoning the instruction dataset. Experimentally, we demonstrate that aligned
answers significantly worsen the performance of the resulting fine-tuned
model's on various reasoning benchmarks such as Big Bench (BBH), Massive
Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning
Over Paragraphs (DROP), performing worse than the counterpart tuned without
alignment by 4-33%.

通过实验证明，内容安全问题角度来看，对齐对指令调整模型的性能有负面影响，尤其是在各种推理基准测试中，通过有对齐的答案进行调整会使性能下降 4-33%。