Recent improvements in large-scale language models have driven progress on automatic generation of syntactically and semantically consistent text for many real-world applications. Many of these advances leverage the availability of large corpora. While training on such corpora encourages the model to understand long-range dependencies in text, it can also result in the models internalizing the social biases present in the corpora. This paper aims to quantify and reduce biases exhibited by language models. Given a conditioning context (e.g. a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g. country names, occupations, genders, etc.) in the conditioning context, a.k.a. counterfactual evaluation. We quantify these biases by adapting individual and group fairness metrics from the fair machine learning literature. Extensive evaluation on two different corpora (news articles and Wikipedia) shows that state-of-the-art Transformer-based language models exhibit biases learned from data. We propose embedding-similarity and sentiment-similarity regularization methods that improve both individual and group fairness metrics without sacrificing perplexity and semantic similarity---a positive step toward development and deployment of fairer language models for real-world applications.

本文旨在量化并减少语言模型中表现出的情感偏见，该文分析了在给定的条件下（例如写作提示）和语言模型中，引起生成的文本情感发生变化的敏感属性（例如国家名称，职业，性别）的值变化的影响。我们采用公平机器学习文献中的个体和团体公正度量来量化情感偏见，并证明在两种不同的语料库（新闻文章和维基百科）上训练的大规模模型存在相当高的偏见。我们随后提出使用嵌入和情感预测导出的正则化方法，该方法应用于语言模型的潜在表示。该正则化提高了公正度量，同时保持了可比水平的困惑度和语义相似性。

通过反事实评估减少语言模型中的情感偏见