Algorithmic harms are commonly categorized as either allocative or
representational. This study specifically addresses the latter, focusing on an
examination of current definitions of representational harms to discern what is
included and what is not. This analysis motivates our expansion beyond
behavioral definitions to encompass harms to cognitive and affective states.
The paper outlines high-level requirements for measurement: identifying the
necessary expertise to implement this approach and illustrating it through a
case study. Our work highlights the unique vulnerabilities of large language
models to perpetrating representational harms, particularly when these harms go
unmeasured and unmitigated. The work concludes by presenting proposed
mitigations and delineating when to employ them. The overarching aim of this
research is to establish a framework for broadening the definition of
representational harms and to translate insights from fairness research into
practical measurement and mitigation praxis.

该研究旨在扩大对可代表性危害的定义，通过量化和减轻大型语言模型对认知和情感状态造成的损害，建立一个公平研究的实用度量和减轻的框架。

超越行为主义的表征伤害：测量与缓减计划

Beyond Behaviorist Representational Harms: A Plan for Measurement and  Mitigation

To recognize and mitigate harms from large language models (LLMs), we need to
understand the prevalence and nuances of stereotypes in LLM outputs. Toward
this end, we present Marked Personas, a prompt-based method to measure
stereotypes in LLMs for intersectional demographic groups without any lexicon
or data labeling. Grounded in the sociolinguistic concept of markedness (which
characterizes explicitly linguistically marked categories versus unmarked
defaults), our proposed method is twofold: 1) prompting an LLM to generate
personas, i.e., natural language descriptions, of the target demographic group
alongside personas of unmarked, default groups; 2) identifying the words that
significantly distinguish personas of the target group from corresponding
unmarked ones. We find that the portrayals generated by GPT-3.5 and GPT-4
contain higher rates of racial stereotypes than human-written portrayals using
the same prompts. The words distinguishing personas of marked (non-white,
non-male) groups reflect patterns of othering and exoticizing these
demographics. An intersectional lens further reveals tropes that dominate
portrayals of marginalized groups, such as tropicalism and the
hypersexualization of minoritized women. These representational harms have
concerning implications for downstream applications like story generation.

本文提出了基于提示的标记人物法（Marked Personas），其使用无词库或数据标注的方法来测量具有交叉社会群体的 LLMs 中的刻板印象，结果显示 GPT-3.5 和 GPT-4 生成的叙述比使用相同提示的人类撰写的叙述包含更多种族刻板印象。同时，对于边缘化群体的描绘也存在特定模式，例如热带化和社会萎缩化。这些代表性的伤害对于像故事生成之类的下游应用具有令人担忧的影响。

标记人设：使用自然语言提示来测量语言模型中的刻板印象

Marked Personas: Using Natural Language Prompts to Measure Stereotypes  in Language Models

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from
massive human-written data which contains latent societal biases and toxic
contents. In this paper, we leverage the primary task of PTLMs, i.e., language
modeling, and propose a new metric to quantify manifested implicit
representational harms in PTLMs towards 13 marginalized demographics. Using
this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our
analysis provides insights into the correlation between the proposed metric in
this work and other related metrics for representational harm. We observe that
our metric correlates with most of the gender-specific metrics in the
literature. Through extensive experiments, we explore the connections between
PTLMs architectures and representational harms across two dimensions: depth and
width of the networks. We found that prioritizing depth over width, mitigates
representational harms in some PTLMs. Our code and data can be found at
this https URL.

本文通过对普及的预训练语言模型（PTLMs）的大规模数据进行实证分析，探讨测量 PTLMs 中对 13 个弱势人群的隐含偏见和有害内容所产生的表示损害的方法，并发现神经网络的深度对于减轻表示损害有所帮助。