BriefGPT.xyz
Aug, 2024
在空间任务上评估大型语言模型:多任务基准研究
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
HTML
PDF
Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo...
TL;DR
本研究填补了大型语言模型在空间任务表现评估的空白,通过引入一个新的多任务空间评估数据集,系统性探讨和比较了多种先进模型在空间任务上的表现。研究发现,gpt-4o在整体准确率上表现最佳,同时特定的提示策略显著提升了模型在特定任务中的表现。
Abstract
The advent of
large language models
such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on
→