As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

使用人工智能生成模型，我们定义了一个广泛的内容安全风险分类法，同时创建了一个新的数据集AEGISSAFETYDATASET，用于研究和评估大型语言模型的安全性能。通过实验表明，我们提出的模型AEGISSAFETYEXPERTS不仅在多个安全风险类别中表现出色，而且在多个攻击类型下也显示出鲁棒性。此外，我们提出了AEGIS方法，利用一系列LLM内容安全专家进行内容安全检查。

AEGIS: 在线自适应AI内容安全审查与LLM专家集成