While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

利用人工智能辅助技术，我们引入EvoGrad平台，拓展了Winograd Schema Challenge任务实例数量，从182个扩展到了3,691个，为多样化的常识推理数据集设定了新的基准。通过引入错误深度度量，我们评估了模型在动态任务中的稳定性。我们的研究结果凸显了EvoGrad提出的挑战：即使是表现最好的大型语言模型GPT-3.5，在准确率上也只有65.0％，平均错误深度为7.2，与人类准确率92.8％相比存在显著差距，这突出了模型的局限性和动态数据集的价值。

EvoGrad: 以人类对手为基础的Winograd模式挑战的动态方法