Large Language Models (LLMs) have transformed the Natural Language Processing (NLP) landscape with their remarkable ability to understand and generate human-like text. However, these models are prone to ``hallucinations'' -- outputs that do not align with factual reality or the input context. This paper introduces the Hallucinations Leaderboard, an open initiative to quantitatively measure and compare the tendency of each model to produce hallucinations. The leaderboard uses a comprehensive set of benchmarks focusing on different aspects of hallucinations, such as factuality and faithfulness, across various tasks, including question-answering, summarisation, and reading comprehension. Our analysis provides insights into the performance of different models, guiding researchers and practitioners in choosing the most reliable models for their applications.

该论文介绍了幻觉排行榜，一个旨在定量衡量和比较每个模型产生幻觉倾向的开放性倡议，通过一系列综合评估模型的基准测试，如准确性和忠实度等方面，涵盖了问答、摘要和阅读理解等不同任务，为研究人员和实践者指导选择最可靠的模型。

幻觉排行榜-量化大型语言模型中的幻觉