This paper presents an argument that certain AI safety measures, rather than
mitigating existential risk, may instead exacerbate it. Under certain key
assumptions - the inevitability of AI failure, the expected correlation between
an AI system's power at the point of failure and the severity of the resulting
harm, and the tendency of safety measures to enable AI systems to become more
powerful before failing - safety efforts have negative expected utility. The
paper examines three response strategies: Optimism, Mitigation, and Holism.
Each faces challenges stemming from intrinsic features of the AI safety
landscape that we term Bottlenecking, the Perfection Barrier, and Equilibrium
Fluctuation. The surprising robustness of the argument forces a re-examination
of core assumptions around AI safety and points to several avenues for further
research.

AI 安全措施可能加剧而非减轻存在风险，对 AI 失败的不可避免性、失败点 AI 系统能力与伤害严重程度的预期相关性以及安全措施在失败前使 AI 系统更强大的倾向等核心假设提出负面预期效用。本文探讨了乐观主义、缓解和整体性三种应对策略，每种策略面临 AI 安全景观内固有特征所带来的挑战，例如瓶颈、完美障碍和平衡波动。该论点的意外稳健性迫使重新审视 AI 安全的核心假设，并指出了一些值得进一步研究的方向。

人工智能安全：通往末日的攀登？

AI Safety: A Climb To Armageddon?

This paper explores the pressing issue of risk assessment in Large Language
Models (LLMs) as they become increasingly prevalent in various applications.
Focusing on how reward models, which are designed to fine-tune pretrained LLMs
to align with human values, perceive and categorize different types of risks,
we delve into the challenges posed by the subjective nature of preference-based
training data. By utilizing the Anthropic Red-team dataset, we analyze major
risk categories, including Information Hazards, Malicious Uses, and
Discrimination/Hateful content. Our findings indicate that LLMs tend to
consider Information Hazards less harmful, a finding confirmed by a specially
developed regression model. Additionally, our analysis shows that LLMs respond
less stringently to Information Hazards compared to other risks. The study
further reveals a significant vulnerability of LLMs to jailbreaking attacks in
Information Hazard scenarios, highlighting a critical security concern in LLM
risk assessment and emphasizing the need for improved AI safety measures.

本文对大型语言模型（LLMs）中的风险评估问题进行了探讨，并重点研究了奖励模型在感知和分类不同类型风险时面临的挑战，通过使用 Anthropic Red 团队的数据集，对信息风险、恶意用途和歧视 / 仇恨内容等主要风险类别进行了分析，研究结果表明 LLMs 倾向于认为信息风险较少有害，并通过特殊开发的回归模型对此进行了确认，此外，研究还揭示了 LLMs 在信息风险场景中对风险反应较不严格，强调了 LLM 风险评估中的关键安全问题以及对改进人工智能安全措施的需求。