Factual inconsistency with source documents in automatically generated summaries can lead to misinformation or pose risks. Existing factual consistency(FC) metrics are constrained by their performance, efficiency, and explainability. Recent advances in Large language models (LLMs) have demonstrated remarkable potential in text evaluation but their effectiveness in assessing FC in summarisation remains underexplored. Prior research has mostly focused on proprietary LLMs, leaving essential factors that affect their assessment capabilities unexplored. Additionally, current FC evaluation benchmarks are restricted to news articles, casting doubt on the generality of the FC methods tested on them. In this paper, we first address the gap by introducing TreatFact a dataset of LLM-generated summaries of clinical texts, annotated for FC by domain experts. Moreover, we benchmark 11 LLMs for FC evaluation across news and clinical domains and analyse the impact of model size, prompts, pre-training and fine-tuning data. Our findings reveal that despite proprietary models prevailing on the task, open-source LLMs lag behind. Nevertheless, there is potential for enhancing the performance of open-source LLMs through increasing model size, expanding pre-training data, and developing well-curated fine-tuning data. Experiments on TreatFact suggest that both previous methods and LLM-based evaluators are unable to capture factual inconsistencies in clinical summaries, posing a new challenge for FC evaluation.

自动产生的摘要与源文件的实际不一致可能导致错误信息或存在风险。现有的实际一致性（FC）指标受性能、效率和可解释性的限制。大型语言模型（LLM）的最新进展在文本评估方面表现出了显著的潜力，但其在总结中评估FC的效果尚未充分探索。本文首先通过引入TreatFact数据集来填补这一空白，该数据集包含由领域专家进行FC注释的LLM生成的临床文本摘要。此外，我们在新闻和临床领域对11个LLM进行了FC评估，并分析了模型大小、提示、预训练和微调数据的影响。研究发现，尽管专有模型在任务上占主导地位，但开源LLM仍然落后。然而，通过增加模型大小、扩展预训练数据和开发精心策划的微调数据，有潜力提升开源LLM的性能。在TreatFact上的实验表明，先前的方法和基于LLM的评估器都无法捕捉到临床摘要中的实际不一致性，给FC评估提出了新的挑战。

在大语言模型时代的摘要一致性评估