This paper explores the pressing issue of risk assessment in Large Language
Models (LLMs) as they become increasingly prevalent in various applications.
Focusing on how reward models, which are designed to fine-tune pretrained LLMs
to align with human values, perceive and categorize different types of risks,
we delve into the challenges posed by the subjective nature of preference-based
training data. By utilizing the Anthropic Red-team dataset, we analyze major
risk categories, including Information Hazards, Malicious Uses, and
Discrimination/Hateful content. Our findings indicate that LLMs tend to
consider Information Hazards less harmful, a finding confirmed by a specially
developed regression model. Additionally, our analysis shows that LLMs respond
less stringently to Information Hazards compared to other risks. The study
further reveals a significant vulnerability of LLMs to jailbreaking attacks in
Information Hazard scenarios, highlighting a critical security concern in LLM
risk assessment and emphasizing the need for improved AI safety measures.

本文对大型语言模型（LLMs）中的风险评估问题进行了探讨，并重点研究了奖励模型在感知和分类不同类型风险时面临的挑战，通过使用 Anthropic Red 团队的数据集，对信息风险、恶意用途和歧视 / 仇恨内容等主要风险类别进行了分析，研究结果表明 LLMs 倾向于认为信息风险较少有害，并通过特殊开发的回归模型对此进行了确认，此外，研究还揭示了 LLMs 在信息风险场景中对风险反应较不严格，强调了 LLM 风险评估中的关键安全问题以及对改进人工智能安全措施的需求。