A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.

本研究针对大型语言模型（LLMs）常见的幻觉现象进行探讨，评估其在重述和翻译任务中的内在幻觉检测能力。通过分析不同模型在各种任务和语言中的表现，研究发现模型性能在各种任务间存在差异，但在特定提示下表现一致，且自然语言推理模型的表现同样优异，表明LLM为基础的检测方法并非唯一选择。

大型语言模型能否检测重述和机器翻译中的内在幻觉？