In the rapidly evolving landscape of Natural Language Processing (NLP), the use of Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. Despite the impressive innovations in developing LLMs like ChatGPT, their efficacy, and accuracy as annotation tools are not well understood. In this paper, we analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts, benchmarking their performance against human annotators' (i.e., crowd-sourced) judgments. Additionally, we investigate the conditions under which LLMs are likely to disagree with human judgment. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'. We argue that LLMs perform well when human annotators do, and when LLMs fail, it often corresponds to situations in which human annotators struggle to reach an agreement. We conclude with recommendations for a comprehensive approach that combines the precision of human expertise with the scalability of LLM predictions. This study highlights the importance of improving the accuracy and comprehensiveness of automated stance detection, aiming to advance these technologies for more efficient and unbiased analysis of social media.

分析了使用大型语言模型在社交媒体中自动进行文本标注的效果和准确性，并探讨了人工标注者与模型的判断差异，发现模型通常在人工标注者很难达成一致意见的情况下表现不佳，对于进一步提高自动立场检测的准确性和全面性，建议综合运用人工专业知识和模型预测的方法。

推进社交媒体帖子立场注释：对大型语言模型和众包的比较分析