Language models trained on large-scale unfiltered datasets curated from the
open web acquire systemic biases, prejudices, and harmful views from their
training data. We present a methodology for programmatically identifying and
removing harmful text from web-scale datasets. A pretrained language model is
used to calculate the log-likelihood of researcher-written trigger phrases
conditioned on a specific document, which is used to identify and filter
documents from the dataset. We demonstrate that models trained on this filtered
dataset exhibit lower propensity to generate harmful text, with a marginal
decrease in performance on standard language modeling benchmarks compared to
unfiltered baselines. We provide a partial explanation for this performance gap
by surfacing examples of hate speech and other undesirable content from
standard language modeling benchmarks. Finally, we discuss the generalization
of this method and how trigger phrases which reflect specific values can be
used by researchers to build language models which are more closely aligned
with their values.

提出一种从网页规模数据集中识别和过滤有害文本的方法，使用预训练语言模型计算特定文档条件下研究员编写的触发词组的对数似然，并根据该结果识别和过滤数据集中的文档，证明在过滤后的数据集上训练的语言模型产生有害文本的倾向更低，性能与未过滤基线相比略有降低，最后探讨了此方法的推广前景及其对语言模型值域的对齐性方面的作用。