This paper presents reports on a series of experiments with a novel dataset
evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open
text responses to short answer questions, Specifically, we explore how well
different combinations of GPT version and prompt engineering strategies
performed at marking real student answers to short answer across different
domain areas (Science and History) and grade-levels (spanning ages 5-16) using
a new, never-used-before dataset from Carousel, a quizzing platform. We found
that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and,
importantly, very close to human-level performance (0.75). This research builds
on prior findings that GPT-4 could reliably score short answer reading
comprehension questions at a performance-level very close to that of expert
human raters. The proximity to human-level performance, across a variety of
subjects and grade levels suggests that LLMs could be a valuable tool for
supporting low-stakes formative assessment tasks in K-12 education and has
important implications for real-world education delivery.

这篇论文讨论了使用大型语言模型（LLMs）对开放文本短答案问题进行评分的实验，研究了不同组合的 GPT 版本和提示工程策略在标记真实学生答案时的性能表现，并发现 GPT-4 在这方面表现良好与人类级别接近。这一研究对于支持 K-12 教育中的低风险形成性评估任务具有重要意义。

大型语言模型是否能胜任？一项实证研究评估 LLM 评分 K-12 教育中的简答题能力

Can Large Language Models Make the Grade? An Empirical Study Evaluating  LLMs Ability to Mark Short Answer Questions in K-12 Education

Grading short answer questions automatically with interpretable reasoning
behind the grading decision is a challenging goal for current transformer
approaches. Justification cue detection, in combination with logical reasoners,
has shown a promising direction for neuro-symbolic architectures in ASAG. But,
one of the main challenges is the requirement of annotated justification cues
in the students' responses, which only exist for a few ASAG datasets. To
overcome this challenge, we contribute (1) a weakly supervised annotation
procedure for justification cues in ASAG datasets, and (2) a neuro-symbolic
model for explainable ASAG based on justification cues. Our approach improves
upon the RMSE by 0.24 to 0.3 compared to the state-of-the-art on the Short
Answer Feedback dataset in a bilingual, multi-domain, and multi-question
training setup. This result shows that our approach provides a promising
direction for generating high-quality grades and accompanying explanations for
future research in ASAG and educational NLP.

自动评分短问答题并解释评分决策是当下转换器方法的一项具有挑战性的目标。在 ASAG 中，自动检测评分理由并与逻辑推理相结合已经展现出一种有希望的方向，但主要挑战之一是要求学生回答中存在经过注解的评分理由，而这种注解在现有 ASAG 数据集中只有很少。为解决这个挑战，我们提出了（1）一种适用于 ASAG 数据集中评分理由的弱监督注解过程，以及（2）一种基于评分理由的可解释 ASAG 的神经符号模型。在双语、多领域、多问题的训练设置中，与现有最先进技术相比，我们的方法将均方根误差（RMSE）提高了 0.24 至 0.3。这个结果表明我们的方法为 ASAG 和教育 NLP 领域的未来研究提供了一个有前景的方向，能够生成高质量的成绩和相应的解释。

通过可解释的神经符号管道增强多领域自动短答案评分

Enhancing Multi-Domain Automatic Short Answer Grading through an  Explainable Neuro-Symbolic Pipeline

We investigate the effectiveness of ensembles of pretrained transformer-based
language models on short answer questions using the Kaggle Automated Short
Answer Scoring dataset. We fine-tune a collection of popular small, base, and
large pretrained transformer-based language models, and train one feature-base
model on the dataset with the aim of testing ensembles of these models. We used
an early stopping mechanism and hyperparameter optimization in training. We
observe that generally that the larger models perform slightly better, however,
they still fall short of state-of-the-art results one their own. Once we
consider ensembles of models, there are ensembles of a number of large networks
that do produce state-of-the-art results, however, these ensembles are too
large to realistically be put in a production environment.

本研究探讨了利用 Kaggle 自动化短问题评分数据集，通过微调一系列的小型、基础型、大型预训练 Transformer 语言模型，并训练一个特征模型来测试这些模型的集成的有效性。观察到较大的模型通常表现稍好，但它们仍无法自己达到最优结果，只有通过大量网络集成才能产生最优结果，但这些集成过于庞大，无法应用于实际生产环境。