Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

本研究系统评估了两种模型(基于GPT-3.5的ChatGPT和GPT-4)，并将它们与人类导师在各种情形下的表现进行比较。我们使用五个Python编程问题和来自在线平台的真实有bug程序进行评估，并使用基于专家的注释进行评估。结果表明，GPT-4明显优于ChatGPT，并在某些场景下接近人类导师的表现，但在某些情况下仍表现不佳。

面向编程教育的生成式人工智能：ChatGPT、GPT-4和人类导师的基准测试