The use of large language models (LLMs) is expanding rapidly, and open-source versions are becoming available, offering users safer and more adaptable options. These models enable users to protect data privacy by eliminating the need to provide data to third parties and can be customized for specific tasks. In this study, we compare the performance of various language models on the Sustainable Development Goal (SDG) mapping task, using the output of GPT-4o as the baseline. The selected open-source models for comparison include Mixtral, LLaMA 2, LLaMA 3, Gemma, and Qwen2. Additionally, GPT-4o-mini, a more specialized version of GPT-4o, was included to extend the comparison. Given the multi-label nature of the SDG mapping task, we employed metrics such as F1 score, precision, and recall with micro-averaging to evaluate different aspects of the models' performance. These metrics are derived from the confusion matrix to ensure a comprehensive evaluation. We provide a clear observation and analysis of each model's performance by plotting curves based on F1 score, precision, and recall at different thresholds. According to the results of this experiment, LLaMA 2 and Gemma still have significant room for improvement. The other four models do not exhibit particularly large differences in performance. The outputs from all seven models are available on Zenodo: https://doi.org/10.5281/zenodo.12789375.

本研究针对大型语言模型（LLMs）在可持续发展目标（SDG）映射任务中的表现进行了比较，填补了对多种开源模型在这一领域表现的评估空白。研究采用多种评估指标（如F1分数、精准率和召回率）对模型性能进行了深入分析，结果显示LLaMA 2和Gemma仍需显著改进，而其他模型表现差异不大。此项研究为选择适合的语言模型提供了实证依据，促进了SDG映射的进展。

评估大型语言模型在可持续发展目标映射中的性能