Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.

本研究提出了两种新的预训练数据增强策略（MEDA 和 INST），可以显著降低模型的有毒内容，而不影响其实用性，进而证明我们的最佳策略（INST）可将模型毒性概率降低长达61％，同时在五个基准 NLP 任务上保持准确性并将四个偏差检测任务的AUC得分提高了1.3％。我们还展示了该技术的泛化性，通过提高训练样本和模型参数的数量。

预训练中添加指导：控制语言模型毒性的有效方式