Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at www.github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.

通过寻找语法模式，我们在三个领域的语料库中发现了暗示刻板印象和非刻板印象的性别角色分配（例如女护士与男舞者），并发布了首个包含108k多样化英语句子的大规模性别偏见数据集，使用它来评估各种指代解析和机器翻译模型中的性别偏见，发现所有测试模型在处理自然输入时都倾向于过度依赖性别刻板印象。我们的数据集和模型都在www.github.com/SLAB-NLP/BUG上公开，希望它们能在实际环境中促进未来的性别偏见评估和缓解技术研究。

为指代消解和机器翻译收集大规模性别偏置数据集