Hallucinations pose a significant challenge to the reliability and alignment
of Large Language Models (LLMs), limiting their widespread acceptance beyond
chatbot applications. Despite ongoing efforts, hallucinations remain a
prevalent challenge in LLMs. The detection of hallucinations itself is also a
formidable task, frequently requiring manual labeling or constrained
evaluations. This paper introduces an automated scalable framework that
combines benchmarking LLMs' hallucination tendencies with efficient
hallucination detection. We leverage LLMs to generate challenging tasks related
to hypothetical phenomena, subsequently employing them as agents for efficient
hallucination detection. The framework is domain-agnostic, allowing the use of
any language model for benchmark creation or evaluation in any domain. We
introduce the publicly available HypoTermQA Benchmarking Dataset, on which
state-of-the-art models' performance ranged between 3% and 11%, and evaluator
agents demonstrated a 6% error rate in hallucination prediction. The proposed
framework provides opportunities to test and improve LLMs. Additionally, it has
the potential to generate benchmarking datasets tailored to specific domains,
such as law, health, and finance.

介绍了一个自动可扩展的框架，将大型语言模型（LLM）的幻觉倾向与高效的幻觉检测相结合，提供了测试和改进 LLMs 的机会，并有潜力生成特定领域的基准数据集。