公共网络抓取语料库中的不良内容初步分析

May, 2021

公共网络抓取语料库中的不良内容初步分析

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Alexandra, Luccioni, Joseph D. Viviano

TL;DR本文探讨了当前神经语言模型的成功主要归功于训练语料库规模的不断增大。但是，我们扩大了对Common Crawl的探索，发现即使在过滤程序之后，它仍然包含大量不良内容，包括仇恨言论和性暗示内容。我们对这些内容对语言模型的潜在影响进行了讨论，最后提出了未来的研究方向和更加慎重的语料库收集和分析方法。

Abstract

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of