Despite the subjective nature of many NLP tasks, most NLU evaluations have
focused on using the majority label with presumably high agreement as the
ground truth. Less attention has been paid to the distribution of human
opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to
study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset
is created by collecting 100 annotations per example for 3,113 examples in SNLI
and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high
human disagreement exists in a noticeable amount of examples in these datasets;
(2) the state-of-the-art models lack the ability to recover the distribution
over human labels; (3) models achieve near-perfect accuracy on the subset of
data with a high level of human agreement, whereas they can barely beat a
random guess on the data with low levels of human agreement, which compose most
of the common errors made by state-of-the-art models on the evaluation sets.
This questions the validity of improving model performance on old metrics for
the low-agreement part of evaluation datasets. Hence, we argue for a detailed
examination of human agreement in future data collection efforts, and
evaluating model outputs against the distribution over collective human
opinions. The ChaosNLI dataset and experimental scripts are available at
this https URL

通过 ChaosNLI 数据集，该研究发现人们在 NLI 评估中存在高度的主观性，新颖度极强的数据集会导致现有模型表现不佳，并提出了考虑人类评价的分布的新评估指标。