Language serves as a powerful tool for the manifestation of societal belief systems. In doing so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. With LLMs increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. Prior work often treats gender bias as a binary classification task. However, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. Specifically, we create the first dataset of GPT-generated English text with normative ratings of gender bias. Ratings were obtained using Best--Worst Scaling -- an efficient comparative annotation framework. Next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. Finally, we show the performance of existing automated models trained on related concepts on our dataset.

语言作为一种强大的工具，用于展示社会信仰体系，同时也延续了我们社会中普遍存在的偏见。性别偏见是我们社会中最普遍的偏见之一，在线和离线话语中都有所体现。随着语言模型越来越接近人类的流利程度，我们需要深入了解这些系统可能产生的偏见。先前的研究通常将性别偏见视为二元分类任务。然而，我们认识到偏见必须按照相对的尺度来感知，因此我们研究了各种程度偏见的生成和相关性质，并调查了手动注释者对这些偏见的接受程度。具体来说，我们创建了第一个带有性别偏见的GPT生成英文文本数据集，并使用最佳-最差比例进行了权威评级以获得相对评估的度量。接下来，我们系统分析了观察到的排名中性别偏见主题的变化，并显示了攻击身份是与性别偏见最相关的。最后，我们展示了现有模型在我们的数据集上训练的相关概念上的性能。

GPT生成的英文文本中的性别偏见的规范评级