Grading of exams is an important, labor intensive, subjective, repetitive and
frequently challenging task. The feasibility of autograding textual responses
has greatly increased thanks to the availability of large language models
(LLMs) such as ChatGPT and because of the substantial influx of data brought
about by digitalization. However, entrusting AI models with decision-making
roles raises ethical considerations, mainly stemming from potential biases and
issues related to generating false information. Thus, in this manuscript we
provide an evaluation of a large language model for the purpose of autograding,
while also highlighting how LLMs can support educators in validating their
grading procedures. Our evaluation is targeted towards automatic short textual
answers grading (ASAG), spanning various languages and examinations from two
distinct courses. Our findings suggest that while "out-of-the-box" LLMs provide
a valuable tool to provide a complementary perspective, their readiness for
independent automated grading remains a work in progress, necessitating human
oversight.

通过评估大型语言模型在自动评分方面的可行性，并强调大型语言模型如何支持教育工作者验证评分程序，研究表明，虽然 “开箱即用” 的大型语言模型提供了宝贵的工具来提供补充视角，但它们对于独立自动评分的准备工作仍然是一个尚未完成的工作，需要人工监督。

基于 LLM 的短文本答案自动评分方法探究

Towards LLM-based Autograding for Short Textual Answers

Autograding short textual answers has become much more feasible due to the
rise of NLP and the increased availability of question-answer pairs brought
about by a shift to online education. Autograding performance is still inferior
to human grading. The statistical and black-box nature of state-of-the-art
machine learning models makes them untrustworthy, raising ethical concerns and
limiting their practical utility. Furthermore, the evaluation of autograding is
typically confined to small, monolingual datasets for a specific question type.
This study uses a large dataset consisting of about 10 million question-answer
pairs from multiple languages covering diverse fields such as math and
language, and strong variation in question and answer syntax. We demonstrate
the effectiveness of fine-tuning transformer models for autograding for such
complex datasets. Our best hyperparameter-tuned model yields an accuracy of
about 86.5\%, comparable to the state-of-the-art models that are less general
and more tuned to a specific type of question, subject, and language. More
importantly, we address trust and ethical concerns. By involving humans in the
autograding process, we show how to improve the accuracy of automatically
graded answers, achieving accuracy equivalent to that of teaching assistants.
We also show how teachers can effectively control the type of errors made by
the system and how they can validate efficiently that the autograder's
performance on individual exams is close to the expected performance.

本研究使用由 10 million 問題 - 答案組成的大型多語言數據集，展示了對 Transformer 模型的微調可以應用於複雜數據集的自動評分，並討論了評分的信任和倫理問題。透過人工介入自動評分的過程，我們展示了如何提高自動化評分答案的準確性，並實現了相當於助教的準確性。同時，我們提出了一種有效的方法讓老師控制系統出現的錯誤類型，並且有效地驗證自動評分器在個別考試上的表現接近預期的表現。