While many capabilities of language models (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus focus on studying only those hallucinations where a correct answer appears verbatim in the training set. To fully control the training data content, we construct a knowledge graph (KG)-based dataset, and use it to train a set of increasingly large LMs. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, hallucinating on $\leq5$% of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.

本研究针对语言模型的幻觉问题，特别是训练集中的正确答案如何影响幻觉现象。通过构建知识图谱数据集并训练不同规模的语言模型，发现更大的模型和更长的训练时间会降低幻觉发生率，但要实现较低的幻觉率需要显著更大的模型和计算成本。此外，研究还揭示了语言模型的规模与幻觉可检测性之间的逆向关系。

基于知识图谱训练语言模型：对幻觉及其可检测性的洞察