Many studies have demonstrated that large language models (LLMs) can produce
harmful responses, exposing users to unexpected risks when LLMs are deployed.
Previous studies have proposed comprehensive taxonomies of the risks posed by
LLMs, as well as corresponding prompts that can be used to examine the safety
mechanisms of LLMs. However, the focus has been almost exclusively on English,
and little has been explored for other languages. Here we aim to bridge this
gap. We first introduce a dataset for the safety evaluation of Chinese LLMs,
and then extend it to two other scenarios that can be used to better identify
false negative and false positive examples in terms of risky prompt rejections.
We further present a set of fine-grained safety assessment criteria for each
risk type, facilitating both manual annotation and automatic evaluation in
terms of LLM response harmfulness. Our experiments on five LLMs show that
region-specific risks are the prevalent type of risk, presenting the major
issue with all Chinese LLMs we experimented with. Warning: this paper contains
example data that may be offensive, harmful, or biased.

通过引入一个用于评估中文 LLM 安全性的数据集，我们扩展到其他两个场景，用于更好地识别有风险的提示拒绝的假阴性和假阳性示例，并提出了细化的每种风险类型的安全评估标准，为 LLM 响应的有害性进行手动注释和自动评估。我们在五个 LLM 上的实验表明，区域特定风险是最普遍的风险类型，是我们所研究的所有中文 LLM 的主要问题。