In this research short, we examine the potential of using GPT-4o, a state-of-the-art large language model (LLM) to undertake evidence synthesis and systematic assessment tasks. Traditional workflows for such tasks involve large groups of domain experts who manually review and synthesize vast amounts of literature. The exponential growth of scientific literature and recent advances in LLMs provide an opportunity to complementing these traditional workflows with new age tools. We assess the efficacy of GPT-4o to do these tasks on a sample from the dataset created by the Global Adaptation Mapping Initiative (GAMI) where we check the accuracy of climate change adaptation related feature extraction from the scientific literature across three levels of expertise. Our results indicate that while GPT-4o can achieve high accuracy in low-expertise tasks like geographic location identification, their performance in intermediate and high-expertise tasks, such as stakeholder identification and assessment of depth of the adaptation response, is less reliable. The findings motivate the need for designing assessment workflows that utilize the strengths of models like GPT-4o while also providing refinements to improve their performance on these tasks.

我们研究了使用GPT-40，一种先进的大型语言模型（LLM），进行证据综述和系统评估任务的潜力。我们评估了GPT-40在全球适应性映射计划（GAMI）数据集中执行这些任务的有效性，结果表明在低专业的任务中，如地理位置识别，GPT-40可以达到很高的准确性，然而在中级和高级专业任务中，如利益相关方识别和适应性响应深度评估，其性能则不可靠。这些发现促进了设计评估工作流程的需求，既利用了GPT-40等模型的优点，也提供了改进它们在这些任务上表现的细化方法。

评估GPT-4o在气候变化证据综合和系统评估中的有效性：初步洞察