State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. Despite extensive efforts to detect and mitigate hallucinations, understanding their internal mechanisms remains elusive. Our study investigates the mechanistic causes of hallucination, specifically non-factual ones where the LM incorrectly predicts object attributes in response to subject-relation queries. With causal mediation analysis and embedding space projection, we identify two general mechanistic causes of hallucinations shared across LMs of various scales and designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and 2) failing to select the correct object attribute in upper layer attention heads and MLPs. These two mechanisms exhibit varying degrees of subject-object association, predictive uncertainty and perturbation robustness. Additionally, we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics for the two mechanistic causes of hallucinations. We also highlight how attribution features from our causal analysis can effectively construct hallucination detectors. Our work proposes a mechanistic understanding of LM factual errors.

我们的研究旨在探索语言模型(LMs)产生非事实幻觉的机制原因，并通过因果中介分析和嵌入空间投影，确定了两种普遍的机制原因：1)较低层MLPs中主语属性知识不足，2)较高层attention heads和MLPs中无法正确选择客体属性。通过对LM预训练检查点的审查，我们揭示了这两种幻觉机制原因的不同学习动态，并强调从因果分析中得出的属性特征可以有效构建幻觉检测器。我们的工作为LM事实错误提供了机制性理解。

语言模型非事实性幻觉的机制研究