Moral foundations theory (MFT) is a psychological assessment tool that
decomposes human moral reasoning into five factors, including care/harm,
liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary
in the weight they place on these dimensions when making moral decisions, in
part due to their cultural upbringing and political ideology. As large language
models (LLMs) are trained on datasets collected from the internet, they may
reflect the biases that are present in such corpora. This paper uses MFT as a
lens to analyze whether popular LLMs have acquired a bias towards a particular
set of moral values. We analyze known LLMs and find they exhibit particular
moral foundations, and show how these relate to human moral foundations and
political affiliations. We also measure the consistency of these biases, or
whether they vary strongly depending on the context of how the model is
prompted. Finally, we show that we can adversarially select prompts that
encourage the moral to exhibit a particular set of moral foundations, and that
this can affect the model's behavior on downstream tasks. These findings help
illustrate the potential risks and unintended consequences of LLMs assuming a
particular moral stance.

本文通过使用道德基础理论分析研究已知的大型语言模型，发现它们存在特定的道德偏见，并展示了这些偏见与人类道德基础和政治倾向之间的关系。此外，研究还衡量了这些偏见的一致性，并证明了通过不同上下文的选择性引导可以影响模型在后续任务中的行为，从而揭示了大型语言模型承担特定道德立场的潜在风险和意外后果。

大型语言模型的道德基础

Moral Foundations of Large Language Models

Pre-trained multilingual language models (PMLMs) are commonly used when
dealing with data from multiple languages and cross-lingual transfer. However,
PMLMs are trained on varying amounts of data for each language. In practice
this means their performance is often much better on English than many other
languages. We explore to what extent this also applies to moral norms. Do the
models capture moral norms from English and impose them on other languages? Do
the models exhibit random and thus potentially harmful beliefs in certain
languages? Both these issues could negatively impact cross-lingual transfer and
potentially lead to harmful outcomes. In this paper, we (1) apply the
MoralDirection framework to multilingual models, comparing results in German,
Czech, Arabic, Mandarin Chinese, and English, (2) analyse model behaviour on
filtered parallel subtitles corpora, and (3) apply the models to a Moral
Foundations Questionnaire, comparing with human responses from different
countries. Our experiments demonstrate that, indeed, PMLMs encode differing
moral biases, but these do not necessarily correspond to cultural differences
or commonalities in human opinions.

本文探讨了预训练的多语言语言模型是否会从英语中捕捉道德规范，并将其强加在其他语言中，以及它们是否在某些语言中表现出随机且潜在有害的信念。研究还在多语言模型上应用了 MoralDirection 框架，分析了在过滤的平行字幕语料库上的模型行为，并将模型应用于道德基础调查问卷，比较不同国家的人类反应。实验表明，预训练的多语言语言模型确实会编码不同的道德偏见，但这些偏见不一定对应于人类观点的文化差异或共性。