The Winograd Schema Challenge (WSC) serves as a prominent benchmark for evaluating machine understanding. While Large Language Models (LLMs) excel at answering WSC questions, their ability to generate such questions remains less explored. In this work, we propose Tree-of-Experts (ToE), a novel prompting method which enhances the generation of WSC instances (50% valid cases vs. 10% in recent methods). Using this approach, we introduce WSC+, a novel dataset comprising 3,026 LLM-generated sentences. Notably, we extend the WSC framework by incorporating new 'ambiguous' and 'offensive' categories, providing a deeper insight into model overconfidence and bias. Our analysis reveals nuances in generation-evaluation consistency, suggesting that LLMs may not always outperform in evaluating their own generated questions when compared to those crafted by other models. On WSC+, GPT-4, the top-performing LLM, achieves an accuracy of 68.7%, significantly below the human benchmark of 95.1%.

提出了Tree-of-Experts (ToE)这一新的提示方法，以增强Winograd Schema Challenge中问题的生成，引入了包含3,026个由Large Language Models生成的句子的新数据集WSC+，并通过将新的'ambiguous'和'offensive'类别纳入WSC框架中，提供了对模型过度自信与偏见的更深入洞察。分析揭示了生成-评估一致性的细微差别，表明与其他模型生成的问题相比，LLMs在评估自己生成的问题时并不总是表现出色。在WSC+上，GPT-4，最好的LLM，准确率为68.7%，明显低于人类基准的95.1%。

WSC+: 基于专家树的增强Winograd Schema挑战