This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.

本研究探讨了OpenAI的GPT-4在多次迭代、时间跨度和风格变化中生成的反馈评分的一致性。通过对高等教育领域宏观经济学任务的回答进行评分，进行统计分析以了解评分的一致性、不同迭代之间的相关性以及内容和风格之间的相关性。结果显示，不同时间跨度的ICC得分介于0.94到0.99之间，表明GPT-4能够在有明确提示的情况下生成一致的评分。内容和风格评分之间的相关性为0.87。使用不恰当的风格时，平均内容评分保持不变，而风格评分下降，这表明大型语言模型在评估过程中有效区分了这两个标准。本研究还介绍和解释了所使用的提示。需要进一步研究以评估AI模型在各种应用场景中的稳健性和可靠性。

GPT-4在评估文本的一致性方面是否可靠？