Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short-answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM-based agents - the reflector and the refiner - into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.

本研究针对开放式简答题（SAGs）在学习分析中存在的评分工作量大和评估不一致等问题，提出了一种统一的多智能体自动简答评分框架GradeOpt。该框架利用大型语言模型（LLMs）并引入反思者和精 refiners两个LLM智能体，通过自我反思优化评分标准，在教学内容知识（PCK）和内容知识（CK）问题的评分实验中展现出优于现有基线的评分准确性和与人类评分者行为的对齐性。

基于大型语言模型的自动评分框架与人类级别指导优化