The increasing threat of disinformation calls for automating parts of the
fact-checking pipeline. Identifying text segments requiring fact-checking is
known as claim detection (CD) and claim check-worthiness detection (CW), the
latter incorporating complex domain-specific criteria of worthiness and often
framed as a ranking task. Zero- and few-shot LLM prompting is an attractive
option for both tasks, as it bypasses the need for labeled datasets and allows
verbalized claim and worthiness criteria to be directly used for prompting. We
evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets
from diverse domains, each utilizing a different worthiness criterion. We
investigate two key aspects: (1) how best to distill factuality and worthiness
criteria into a prompt and (2) what amount of context to provide for each
claim. To this end, we experiment with varying the level of prompt verbosity
and the amount of contextual information provided to the model. Our results
show that optimal prompt verbosity is domain-dependent, adding context does not
improve performance, and confidence scores can be directly used to produce
reliable check-worthiness rankings.

通过使用零 - 和少 - 次学习模型，将事实和价值评估标准直接用于提示，我们评估了 LLM 在五个不同领域的声明检测和可信度检测数据集上的预测和校准准确性，并发现最佳的提示详细程度取决于领域，提供上下文信息并不改善性能，可信度评分可以直接用于生成可靠的评级。

宣称检查价值检测：LLM 对标注指南的理解程度如何？

Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation  Guidelines?

An important component of an automated fact-checking system is the claim
check-worthiness detection system, which ranks sentences by prioritising them
based on their need to be checked. Despite a body of research tackling the
task, previous research has overlooked the challenging nature of identifying
check-worthy claims across different topics. In this paper, we assess and
quantify the challenge of detecting check-worthy claims for new, unseen topics.
After highlighting the problem, we propose the AraCWA model to mitigate the
performance deterioration when detecting check-worthy claims across topics. The
AraCWA model enables boosting the performance for new topics by incorporating
two components for few-shot learning and data augmentation. Using a publicly
available dataset of Arabic tweets consisting of 14 different topics, we
demonstrate that our proposed data augmentation strategy achieves substantial
improvements across topics overall, where the extent of the improvement varies
across topics. Further, we analyse the semantic similarities between topics,
suggesting that the similarity metric could be used as a proxy to determine the
difficulty level of an unseen topic prior to undertaking the task of labelling
the underlying sentences.

本文针对不同主题下识别值得检查的权利要求的挑战进行了评估和量化，提出了 AraCWA 模型来减轻跨主题检测具有检查价值权利要求时的性能下降，该模型通过少量学习和数据增强来为新的主题提高性能，并使用公开数据集的阿拉伯语推文，为不同的主题证明了文章提出的数据增强策略取得了显著的改进。