The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

通过研究神经语言模型在Winograd Schema Challenge任务中的表现，我们发现测试实例与神经语言模型训练语料库之间的重叠对模型分类准确性具有重要影响。我们发现现有训练语料库与测试实例的重叠具有较高比例，导致模型在具有最小重叠的实例上表现显著下降。基于这些结果，我们构建了KnowRef-60K数据集，它是至今为止最大的Winograd Schema Challenge风格的常识推理语料库，并且与当前的预训练语料库重叠比例显著降低。

对Winograd-Style任务数据集重叠的分析