The safety of Large Language Models (LLMs) has gained increasing attention in
recent years, but there still lacks a comprehensive approach for detecting
safety issues within LLMs' responses in an aligned, customizable and
explainable manner. In this paper, we propose ShieldLM, an LLM-based safety
detector, which aligns with general human safety standards, supports
customizable detection rules, and provides explanations for its decisions. To
train ShieldLM, we compile a large bilingual dataset comprising 14,387
query-response pairs, annotating the safety of responses based on various
safety standards. Through extensive experiments, we demonstrate that ShieldLM
surpasses strong baselines across four test sets, showcasing remarkable
customizability and explainability. Besides performing well on standard
detection datasets, ShieldLM has also been shown to be effective in real-world
situations as a safety evaluator for advanced LLMs. We release ShieldLM at
https://github.com/thu-coai/ShieldLM to support accurate and explainable
safety detection under various safety standards, contributing to the ongoing
efforts to enhance the safety of LLMs.

该研究提出了一种基于大型语言模型的安全检测器 ShieldLM，它遵循通用的人类安全标准，支持可定制的检测规则，并提供其决策的解释。通过在包括 14,387 个查询 - 响应对的大型双语数据集上进行训练，研究表明，ShieldLM 在四个测试集上超越了强基准，展示了出色的可定制性和可解释性。除了在标准检测数据集上表现良好外，ShieldLM 还被证明在实际应用中作为先进语言模型的安全评估器具有有效性。通过 https://github.com/thu-coai/ShieldLM 发布的 ShieldLM 可以在各种安全标准下支持准确和可解释的安全检测，并为增强大型语言模型的安全性的持续努力做出贡献。