Large Language Models (LLMs), now used daily by millions of users, can encode
societal biases, exposing their users to representational harms. A large body
of scholarship on LLM bias exists but it predominantly adopts a Western-centric
frame and attends comparatively less to bias levels and potential harms in the
Global South. In this paper, we quantify stereotypical bias in popular LLMs
according to an Indian-centric frame and compare bias levels between the Indian
and Western contexts. To do this, we develop a novel dataset which we call
Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and
anti-stereotypical examples for caste and religion contexts. We find that the
majority of LLMs tested are strongly biased towards stereotypes in the Indian
context, especially as compared to the Western context. We finally investigate
Instruction Prompting as a simple intervention to mitigate such bias and find
that it significantly reduces both stereotypical and anti-stereotypical biases
in the majority of cases for GPT-3.5. The findings of this work highlight the
need for including more diverse voices when evaluating LLMs.

对大型语言模型的研究发现，它们往往存在社会偏见，尤其在印度和西方语境下，而引入一种称为 Instruction Prompting 的简单干预方法能够显著减少这种偏见。

种姓主义但非种族主义？量化印度与西方大型语言模型偏见的差异

Casteist but Not Racist? Quantifying Disparities in Large Language Model  Bias between India and the West

Recent studies have demonstrated how to assess the stereotypical bias in
pre-trained English language models. In this work, we extend this branch of
research in multiple different dimensions by systematically investigating (a)
mono- and multilingual models of (b) different underlying architectures with
respect to their bias in (c) multiple different languages. To that end, we make
use of the English StereoSet data set (Nadeem et al., 2021), which we
semi-automatically translate into German, French, Spanish, and Turkish. We find
that it is of major importance to conduct this type of analysis in a
multilingual setting, as our experiments show a much more nuanced picture as
well as notable differences from the English-only analysis. The main takeaways
from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical
behavior across languages, English (monolingual) models exhibit the strongest
bias, and the stereotypes reflected in the data set are least present in
Turkish models. Finally, we release our codebase alongside the translated data
sets and practical guidelines for the semi-automatic translation to encourage a
further extension of our work to other languages.

通过系统地分析使用不同语言、单语和多语模型、不同架构的偏向性，扩展了评估预训练英语语言模型中的刻板偏见的研究范围，发现在多语言环境下分析是非常重要的，并且公布了代码库以及翻译数据集的实用指南以鼓励将我们的工作进一步扩展到其他语言。

不同语言中的典型偏见有多大差异？

How Different Is Stereotypical Bias Across Languages?

The representations in large language models contain multiple types of gender
information. We focus on two types of such signals in English texts: factual
gender information, which is a grammatical or semantic property, and gender
bias, which is the correlation between a word and specific gender. We can
disentangle the model's embeddings and identify components encoding both types
of information with probing. We aim to diminish the stereotypical bias in the
representations while preserving the factual gender signal. Our filtering
method shows that it is possible to decrease the bias of gender-neutral
profession names without significant deterioration of language modeling
capabilities. The findings can be applied to language generation to mitigate
reliance on stereotypes while preserving gender agreement in coreferences.

本研究探讨大型语言模型中的性别信号，并重点关注英语文本中的两种信号类型：事实性性别信息和性别偏见，在保留事实性性别信号的同时试图减弱刻板印象。研究发现，可以通过过滤方法减少性别中立的专业名称的刻板印象，而不会对语言建模能力造成显着的恶化。这些发现可以应用于语言生成，以缓解对刻板印象的依赖同时保留性别一致的指代。