In this paper, we establish a benchmark named HalluQA (Chinese Hallucination
Question-Answering) to measure the hallucination phenomenon in Chinese large
language models. HalluQA contains 450 meticulously designed adversarial
questions, spanning multiple domains, and takes into account Chinese historical
culture, customs, and social phenomena. During the construction of HalluQA, we
consider two types of hallucinations: imitative falsehoods and factual errors,
and we construct adversarial samples based on GLM-130B and ChatGPT. For
evaluation, we design an automated evaluation method using GPT-4 to judge
whether a model output is hallucinated. We conduct extensive experiments on 24
large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk
and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than
50%. This indicates that HalluQA is highly challenging. We analyze the primary
types of hallucinations in different types of models and their causes.
Additionally, we discuss which types of hallucinations should be prioritized
for different types of models.

在这篇论文中，我们建立了一个名为 HalluQA 的基准，用于衡量中文大型语言模型中的幻觉现象。HalluQA 包含 450 个精心设计的对抗性问题，涵盖多个领域，并考虑了中国的历史文化、习俗和社会现象。我们在构建 HalluQA 过程中考虑了两种类型的幻觉：模仿性虚假和事实错误，并基于 GLM-130B 和 ChatGPT 构建对抗样本。为了评估，我们使用 GPT-4 设计了一种自动评估方法来判断模型输出是否存在幻觉。我们对 24 个大型语言模型进行了广泛实验，包括 ERNIE-Bot、Baichuan2、ChatGLM、Qwen、SparkDesk 等。在这 24 个模型中，有 18 个实现了低于 50% 的非幻觉率。这表明 HalluQA 具有很高的挑战性。我们分析了不同类型模型中主要类型的幻觉及其原因。此外，我们讨论了不同类型模型应优先考虑哪些类型的幻觉。