Large language models (LLMs), such as ChatGPT, are prone to generate
hallucinations, \ie content that conflicts with the source or cannot be
verified by the factual knowledge. To understand what types of content and to
which extent LLMs are apt to hallucinate, we introduce the Hallucination
Evaluation for Large Language Models (HELMA) benchmark, a large collection of
generated and human-annotated hallucinated samples for evaluating the
performance of LLMs in recognizing and alleviating hallucination. To generate
these samples, we propose a ChatGPT-based two-step framework, \ie
sampling-then-filtering. Specifically, we first adopt two different sampling
methods to generate hallucinated samples based on instructions, and then use an
example-enhanced filtering method to select the best one. Furthermore, we also
hire some human labelers to annotate the hallucinations in ChatGPT responses.
The empirical results suggest that ChatGPT has some probabilities to generate
hallucinations and existing LLMs face great challenges in recognizing the
hallucinations in text. In addition, the performance can be improved by
providing external knowledge or adding reasoning steps. Our benchmark can be
accessed at this https URL

本研究引入了 Hallucination Evaluation for Large Language Models（HELMA）基准来评估 LLM 的幻觉表现，并提出了基于 ChatGPT 的取样 - 过滤框架来生成大规模的，人类标注的幻觉数据集，并指出 ChatGPT 生成幻觉的概率较大，现有的 LLM 在识别文本幻觉方面面临巨大挑战，但可通过提供外部知识或添加推理步骤来改善表现。