Hallucinations pose a significant challenge to the reliability of large
language models (LLMs) in critical domains. Recent benchmarks designed to
assess LLM hallucinations within conventional NLP tasks, such as
knowledge-intensive question answering (QA) and summarization, are insufficient
for capturing the complexities of user-LLM interactions in dynamic, real-world
settings. To address this gap, we introduce HaluEval-Wild, the first benchmark
specifically designed to evaluate LLM hallucinations in the wild. We
meticulously collect challenging (adversarially filtered by Alpaca) user
queries from existing real-world user-LLM interaction datasets, including
ShareGPT, to evaluate the hallucination rates of various LLMs. Upon analyzing
the collected queries, we categorize them into five distinct types, which
enables a fine-grained analysis of the types of hallucinations LLMs exhibit,
and synthesize the reference answers with the powerful GPT-4 model and
retrieval-augmented generation (RAG). Our benchmark offers a novel approach
towards enhancing our comprehension and improvement of LLM reliability in
scenarios reflective of real-world interactions.

为了评估大规模语言模型 (LLMs) 在动态的现实世界环境中产生幻觉的能力，我们引入了 HalEval-Wild，这是一个特别设计的评估幻觉的基准测试。通过收集现有的用户 - LLM 交互数据集中具有挑战性的用户查询，并使用强大的 GPT-4 模型和检索增强生成 (RAG) 进行参考答案综合，我们对 LLMs 产生的幻觉进行了细致的分析，从而提供了一种改善 LLM 可靠性的新方法。