Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

自动短答案评分（ASAG）是一个活跃的机器学习研究领域已有十多年的时间。它承诺即使在人工评分师有限的情况下，让教育者对大班课中的自由回答进行评分和反馈。近年来，经过精心训练的模型已经取得了越来越高的性能水平。最近，预训练的大型语言模型（LLMs）作为一种通用工具出现了，并且引发了一个有趣的问题，即没有额外训练的通用工具与专门模型相比如何。我们研究了GPT-4在标准基准2路和3路数据集SciEntsBank和Beetle上的性能，除了评分学生答案与参考答案的对齐标准任务外，还研究了不透露参考答案的情况。我们发现，总体而言，预训练的通用GPT-4 LLM的性能与手工设计的模型相当，但比经过专门训练的LLMs差。

GPT-4大型预训练语言模型在自动化短答案评分中的表现