Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

本研究针对自动事实核查中的重要问题，提出了一种基于检索增强生成（RAG）范式的评估方法，并在更为真实的场景下进行基准测试。研究发现，尽管大型语言模型（LLM）在核查结果的真实性方面表现优异，但在不同类型知识库的处理上仍遇到困难，提示了未来在模型设计上的改进潜力。

面对事实！在现实环境中评估基于RAG的事实核查管道