This paper's primary goal is to provoke thoughtful discussion about the
relationship between bias and fundamental properties of large language models.
We do this by seeking to convince the reader that harmful biases are an
inevitable consequence arising from the design of any large language model as
LLMs are currently formulated. To the extent that this is true, it suggests
that the problem of harmful bias cannot be properly addressed without a serious
reconsideration of AI driven by LLMs, going back to the foundational
assumptions underlying their design.

通过探索大型语言模型 (LLMs) 的设计，本文主要讨论偏见与 LLMs 之间的关系，并试图让读者相信有害偏见是目前 LLMs 设计不可避免的结果，因此要解决有害偏见问题，需要对以 LLMs 为驱动的人工智能进行严肃的重新考虑，回到其设计所基于的基本假设。

大型语言模型的偏见源于其规模

Large Language Models are Biased Because They Are Large Language Models

Text-to-Image (TTI) generative models have shown great progress in the past
few years in terms of their ability to generate complex and high-quality
imagery. At the same time, these models have been shown to suffer from harmful
biases, including exaggerated societal biases (e.g., gender, ethnicity), as
well as incidental correlations that limit such model's ability to generate
more diverse imagery. In this paper, we propose a general approach to study and
quantify a broad spectrum of biases, for any TTI model and for any prompt,
using counterfactual reasoning. Unlike other works that evaluate generated
images on a predefined set of bias axes, our approach automatically identifies
potential biases that might be relevant to the given prompt, and measures those
biases. In addition, our paper extends quantitative scores with post-hoc
explanations in terms of semantic concepts in the images generated. We show
that our method is uniquely capable of explaining complex multi-dimensional
biases through semantic concepts, as well as the intersectionality between
different biases for any given prompt. We perform extensive user studies to
illustrate that the results of our method and analysis are consistent with
human judgements.

我们提出了一种通用方法，通过反事实推理来研究和量化任何文本到图像生成模型和任何提示的广泛偏见和偏差，并以语义概念的形式扩展了定量评分。

TIBET: 文本到图像生成模型中的偏见识别和评估

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative  Models

While vision-language models (VLMs) have achieved remarkable performance
improvements recently, there is growing evidence that these models also posses
harmful biases with respect to social attributes such as gender and race. Prior
studies have primarily focused on probing such bias attributes individually
while ignoring biases associated with intersections between social attributes.
This could be due to the difficulty of collecting an exhaustive set of
image-text pairs for various combinations of social attributes. To address this
challenge, we employ text-to-image diffusion models to produce counterfactual
examples for probing intserctional social biases at scale. Our approach
utilizes Stable Diffusion with cross attention control to produce sets of
counterfactual image-text pairs that are highly similar in their depiction of a
subject (e.g., a given occupation) while differing only in their depiction of
intersectional social attributes (e.g., race & gender). Through our
over-generate-then-filter methodology, we produce SocialCounterfactuals, a
high-quality dataset containing over 171k image-text pairs for probing
intersectional biases related to gender, race, and physical characteristics. We
conduct extensive experiments to demonstrate the usefulness of our generated
dataset for probing and mitigating intersectional social biases in
state-of-the-art VLMs.

使用文本到图像扩散模型在规模上生成对抗实例，以探测和缓解视觉 - 语言模型中的交叉社会偏见。

使用反事实例探究和减轻视觉 - 语言模型中的交叉社会偏见

Probing and Mitigating Intersectional Social Biases in Vision-Language  Models with Counterfactual Examples

Computer vision models have been known to encode harmful biases, leading to
the potentially unfair treatment of historically marginalized groups, such as
people of color. However, there remains a lack of datasets balanced along
demographic traits that can be used to evaluate the downstream fairness of
these models. In this work, we demonstrate that diffusion models can be
leveraged to create such a dataset. We first use a diffusion model to generate
a large set of images depicting various occupations. Subsequently, each image
is edited using inpainting to generate multiple variants, where each variant
refers to a different perceived race. Using this dataset, we benchmark several
vision-language models on a multi-class occupation classification task. We find
that images generated with non-Caucasian labels have a significantly higher
occupation misclassification rate than images generated with Caucasian labels,
and that several misclassifications are suggestive of racial biases. We measure
a model's downstream fairness by computing the standard deviation in the
probability of predicting the true occupation label across the different
perceived identity groups. Using this fairness metric, we find significant
disparities between the evaluated vision-and-language models. We hope that our
work demonstrates the potential value of diffusion methods for fairness
evaluations.

通过扩散模型生成多个包含不同种族标签的职业图像集，我们发现使用非高加索标签生成的图像的职业错误分类率显著高于使用高加索标签生成的图像，且部分错误分类表明存在种族偏见。通过计算对不同身份群体预测的真实职业标签的概率标准差，我们测量了模型的公平性。使用这个公平性指标，我们发现在评估的视觉和语言模型之间存在显著差异。我们希望我们的研究展示了使用扩散方法进行公平性评估的潜在价值。