Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.

借助大型语言模型（LLMs）等新的自然语言处理技术，自动评估课堂讨论质量变得越来越可行。本文研究了两种LLMs的评估性能如何与任务制定、上下文长度和少样本示例等三个可能影响性能的因素相互作用。我们还探讨了两种LLMs的计算效率和预测一致性。结果表明，前述三个因素确实影响了被测试LLMs的性能，并且预测一致性与性能之间存在关系。我们建议采用以LLMs为基础的评估方法，在预测性能、计算效率和一致性方面取得良好平衡。

分析大型语言模型在课堂讨论评估中的应用