Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

本文通过使用道德基础理论分析研究已知的大型语言模型，发现它们存在特定的道德偏见，并展示了这些偏见与人类道德基础和政治倾向之间的关系。此外，研究还衡量了这些偏见的一致性，并证明了通过不同上下文的选择性引导可以影响模型在后续任务中的行为，从而揭示了大型语言模型承担特定道德立场的潜在风险和意外后果。

大型语言模型的道德基础