As Large Language Models (LLMs) and generative AI become more widespread, the
content safety risks associated with their use also increase. We find a notable
deficiency in high-quality content safety datasets and benchmarks that
comprehensively cover a wide range of critical safety areas. To address this,
we define a broad content safety risk taxonomy, comprising 13 critical risk and
9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new
dataset of approximately 26, 000 human-LLM interaction instances, complete with
human annotations adhering to the taxonomy. We plan to release this dataset to
the community to further research and to help benchmark LLM models for safety.
To demonstrate the effectiveness of the dataset, we instruction-tune multiple
LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS),
not only surpass or perform competitively with the state-of-the-art LLM-based
safety models and general purpose LLMs, but also exhibit robustness across
multiple jail-break attack categories. We also show how using
AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact
the performance of the aligned models on MT Bench scores. Furthermore, we
propose AEGIS, a novel application of a no-regret online adaptation framework
with strong theoretical guarantees, to perform content moderation with an
ensemble of LLM content safety experts in deployment

使用人工智能生成模型，我们定义了一个广泛的内容安全风险分类法，同时创建了一个新的数据集 AEGISSAFETYDATASET，用于研究和评估大型语言模型的安全性能。通过实验表明，我们提出的模型 AEGISSAFETYEXPERTS 不仅在多个安全风险类别中表现出色，而且在多个攻击类型下也显示出鲁棒性。此外，我们提出了 AEGIS 方法，利用一系列 LLM 内容安全专家进行内容安全检查。

AEGIS: 在线自适应 AI 内容安全审查与 LLM 专家集成

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM  Experts

Vertical federated learning (VFL) leverages various privacy-preserving
algorithms, e.g., homomorphic encryption or secret sharing based SecureBoost,
to ensure data privacy. However, these algorithms all require a semi-honest
secure definition, which raises concerns in real-world applications. In this
paper, we present Aegis, a trusted, automatic, and accurate verification
framework to verify the security of VFL jobs. Aegis is separated from local
parties to ensure the security of the framework. Furthermore, it automatically
adapts to evolving VFL algorithms by defining the VFL job as a finite state
machine to uniformly verify different algorithms and reproduce the entire job
to provide more accurate verification. We implement and evaluate Aegis with
different threat models on financial and medical datasets. Evaluation results
show that: 1) Aegis can detect 95% threat models, and 2) it provides
fine-grained verification results within 84% of the total VFL job time.

本研究提出了 Aegis，一种值得信赖的自动和准确的验证框架，用于垂直联邦学习的安全验证。它能够检测出 95% 的威胁模型，并在总共 VFL 作业时间的 84％内提供细粒度的验证结果。