This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.

本研究旨在评估 ChatGPT 是否能够与人类及机器评分的准确性相匹配，特别是在 ASAP 竞赛中的表现。研究发现，尽管 ChatGPT 的某些模型在特定数据集上的评分接近人类评分，但整体表现不稳定且往往低于人类评分，强调了需要改进的地方，尤其是在消除偏差和确保评分公平性方面。尽管如此，ChatGPT 在评分效率上展现了潜力，尤其是在特定领域的微调下。研究建议未来应提升模型精度，解决伦理问题，并探索结合 ChatGPT 与经验方法的混合模型。

使用 ChatGPT 对论文和短文构建回应进行评分