Large language models (LMs) are prone to generate diverse factually incorrect statements, which are widely called hallucinations. Current approaches predominantly focus on coarse-grained automatic hallucination detection or editing, overlooking nuanced error levels. In this paper, we propose a novel task -- automatic fine-grained hallucination detection -- and present a comprehensive taxonomy encompassing six hierarchically defined types of hallucination. To facilitate evaluation, we introduce a new benchmark that includes fine-grained human judgments on two LM outputs across various domains. Our analysis reveals that ChatGPT and Llama 2-Chat exhibit hallucinations in 60% and 75% of their outputs, respectively, and a majority of these hallucinations fall into categories that have been underexplored. As an initial step to address this, we train FAVA, a retrieval-augmented LM by carefully designing synthetic data generations to detect and correct fine-grained hallucinations. On our benchmark, our automatic and human evaluations show that FAVA significantly outperforms ChatGPT on fine-grained hallucination detection by a large margin though a large room for future improvement still exists. FAVA's suggested edits also improve the factuality of LM-generated text, resulting in 5-10% FActScore improvements.

大语言模型倾向于生成多样的事实不准确的陈述，本文提出了一个新的任务-自动细粒度幻觉检测，并提出了一个涵盖六种层次定义的幻觉类型的综合分类法。通过引入一个新的基准测试以评估，我们的分析结果显示ChatGPT和Llama 2-Chat的输出中有60%和75%的幻觉，而其中大多数幻觉属于未被充分研究的类别。为了解决这个问题的初步步骤，我们训练了FAVA，一个通过精心设计的合成数据生成来检测和纠正细粒度幻觉的检索增强的语言模型。在我们的基准测试中，我们的自动和人工评估显示FAVA在细粒度幻觉检测方面明显优于ChatGPT，尽管还存在大量改进的空间。FAVA提供的修改还提高了语言模型生成文本的准确性，导致了5-10%的FActScore改进。

细粒度幻觉检测与编辑语言模型