AI systems are increasingly pervasive, yet information needed to decide
whether and how to engage with them may not exist or be accessible. A user may
not be able to verify whether a system satisfies certain safety standards. An
investigator may not know whom to investigate when a system causes an incident.
A platform may find it difficult to penalize repeated negative interactions
with the same system. Across a number of domains, IDs address analogous
problems by identifying \textit{particular} entities (e.g., a particular Boeing
747) and providing information about other entities of the same class (e.g.,
some or all Boeing 747s). We propose a framework in which IDs are ascribed to
\textbf{instances} of AI systems (e.g., a particular chat session with Claude
3), and associated information is accessible to parties seeking to interact
with that system. We characterize IDs for AI systems, argue that there could be
significant demand for IDs from key actors, analyze how those actors could
incentivize ID adoption, explore potential implementations of our framework,
and highlight limitations and risks. IDs seem most warranted in high-stakes
settings, where certain actors (e.g., those that enable AI systems to make
financial transactions) could experiment with incentives for ID use. Deployers
of AI systems could experiment with developing ID implementations. With further
study, IDs could help to manage a world where AI systems pervade society.

提出一个框架，在其中为 AI 系统的实例分配标识，并为寻求与该系统交互的各方提供相关信息。讨论了 AI 系统的标识、潜在需求、激励机制、实现方式以及限制和风险，并指出在高风险场景下更有必要使用标识。通过进一步研究，标识可以帮助管理人工智能系统普遍渗透于社会的世界。

AI 系统的身份识别

IDs for AI Systems

The safety of Large Language Models (LLMs) has gained increasing attention in
recent years, but there still lacks a comprehensive approach for detecting
safety issues within LLMs' responses in an aligned, customizable and
explainable manner. In this paper, we propose ShieldLM, an LLM-based safety
detector, which aligns with general human safety standards, supports
customizable detection rules, and provides explanations for its decisions. To
train ShieldLM, we compile a large bilingual dataset comprising 14,387
query-response pairs, annotating the safety of responses based on various
safety standards. Through extensive experiments, we demonstrate that ShieldLM
surpasses strong baselines across four test sets, showcasing remarkable
customizability and explainability. Besides performing well on standard
detection datasets, ShieldLM has also been shown to be effective in real-world
situations as a safety evaluator for advanced LLMs. We release ShieldLM at
https://github.com/thu-coai/ShieldLM to support accurate and explainable
safety detection under various safety standards, contributing to the ongoing
efforts to enhance the safety of LLMs.

该研究提出了一种基于大型语言模型的安全检测器 ShieldLM，它遵循通用的人类安全标准，支持可定制的检测规则，并提供其决策的解释。通过在包括 14,387 个查询 - 响应对的大型双语数据集上进行训练，研究表明，ShieldLM 在四个测试集上超越了强基准，展示了出色的可定制性和可解释性。除了在标准检测数据集上表现良好外，ShieldLM 还被证明在实际应用中作为先进语言模型的安全评估器具有有效性。通过 https://github.com/thu-coai/ShieldLM 发布的 ShieldLM 可以在各种安全标准下支持准确和可解释的安全检测，并为增强大型语言模型的安全性的持续努力做出贡献。

ShieldLM: 强化 LLM 为一致、可定制和可解释的安全检测器

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable  Safety Detectors

Advanced AI models hold the promise of tremendous benefits for humanity, but
society needs to proactively manage the accompanying risks. In this paper, we
focus on what we term "frontier AI" models: highly capable foundation models
that could possess dangerous capabilities sufficient to pose severe risks to
public safety. Frontier AI models pose a distinct regulatory challenge:
dangerous capabilities can arise unexpectedly; it is difficult to robustly
prevent a deployed model from being misused; and, it is difficult to stop a
model's capabilities from proliferating broadly. To address these challenges,
at least three building blocks for the regulation of frontier models are
needed: (1) standard-setting processes to identify appropriate requirements for
frontier AI developers, (2) registration and reporting requirements to provide
regulators with visibility into frontier AI development processes, and (3)
mechanisms to ensure compliance with safety standards for the development and
deployment of frontier AI models. Industry self-regulation is an important
first step. However, wider societal discussions and government intervention
will be needed to create standards and to ensure compliance with them. We
consider several options to this end, including granting enforcement powers to
supervisory authorities and licensure regimes for frontier AI models. Finally,
we propose an initial set of safety standards. These include conducting
pre-deployment risk assessments; external scrutiny of model behavior; using
risk assessments to inform deployment decisions; and monitoring and responding
to new information about model capabilities and uses post-deployment. We hope
this discussion contributes to the broader conversation on how to balance
public safety risks and innovation benefits from advances at the frontier of AI
development.

前沿 AI 模型的安全性规范与公共安全风险需求有关。建立标准设置流程、注册报告需求以及合规机制是对前沿 AI 模型进行规范的必要步骤。产业自律是重要的第一步，但还需要社会广泛讨论和政府干预以确保规范的制定和遵循。将执法权力授予监管机构和颁发前沿 AI 模型许可制度等选项可实现该目标。本文提出一组初始的安全标准，包括进行部署前的风险评估、外部对模型行为的审查、使用风险评估来指导部署决策以及在部署后监控和响应关于模型能力与应用的新信息。希望本文能为如何平衡公共安全风险与 AI 开发前沿的创新受益的广泛讨论做出贡献。