Evaluating open-ended written examination responses from students is an
essential yet time-intensive task for educators, requiring a high degree of
effort, consistency, and precision. Recent developments in Large Language
Models (LLMs) present a promising opportunity to balance the need for thorough
evaluation with efficient use of educators' time. In our study, we explore the
effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in
assessing university students' open-ended answers to questions made about
reference material they have studied. Each model was instructed to evaluate 54
answers repeatedly under two conditions: 10 times (10-shot) with a temperature
setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of
1,080 evaluations per model and 4,320 evaluations across all models. The RAG
(Retrieval Augmented Generation) framework was used as the framework to make
the LLMs to process the evaluation of the answers. As of spring 2024, our
analysis revealed notable variations in consistency and the grading outcomes
provided by studied LLMs. There is a need to comprehend strengths and
weaknesses of LLMs in educational settings for evaluating open-ended written
responses. Further comparative research is essential to determine the accuracy
and cost-effectiveness of using LLMs for educational assessments.

教育工作者评估开放式书面考试答案是一项需要大量精力、一致性和准确性的重要任务。本研究探索了大型语言模型在评估大学生对参考资料提出的开放式问题的答案时的效果，发现 LLMs 的一致性和评分结果存在显著差异。进一步的比较研究对于确定使用 LLMs 进行教育评估的准确性和成本效益至关重要。