Moral foundations theory (MFT) is a psychological assessment tool that
decomposes human moral reasoning into five factors, including care/harm,
liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary
in the weight they place on these dimensions when making moral decisions, in
part due to their cultural upbringing and political ideology. As large language
models (LLMs) are trained on datasets collected from the internet, they may
reflect the biases that are present in such corpora. This paper uses MFT as a
lens to analyze whether popular LLMs have acquired a bias towards a particular
set of moral values. We analyze known LLMs and find they exhibit particular
moral foundations, and show how these relate to human moral foundations and
political affiliations. We also measure the consistency of these biases, or
whether they vary strongly depending on the context of how the model is
prompted. Finally, we show that we can adversarially select prompts that
encourage the moral to exhibit a particular set of moral foundations, and that
this can affect the model's behavior on downstream tasks. These findings help
illustrate the potential risks and unintended consequences of LLMs assuming a
particular moral stance.

本文通过使用道德基础理论分析研究已知的大型语言模型，发现它们存在特定的道德偏见，并展示了这些偏见与人类道德基础和政治倾向之间的关系。此外，研究还衡量了这些偏见的一致性，并证明了通过不同上下文的选择性引导可以影响模型在后续任务中的行为，从而揭示了大型语言模型承担特定道德立场的潜在风险和意外后果。

大型语言模型的道德基础

Moral Foundations of Large Language Models

With the growing capabilities and pervasiveness of AI systems, societies must
collectively choose between reduced human autonomy, endangered democracies and
limited human rights, and AI that is aligned to human and social values,
nurturing collaboration, resilience, knowledge and ethical behaviour. In this
chapter, we introduce the notion of self-reflective AI systems for meaningful
human control over AI systems. Focusing on decision support systems, we propose
a framework that integrates knowledge from psychology and philosophy with
formal reasoning methods and machine learning approaches to create AI systems
responsive to human values and social norms. We also propose a possible
research approach to design and develop self-reflective capability in AI
systems. Finally, we argue that self-reflective AI systems can lead to
self-reflective hybrid systems (human + AI), thus increasing meaningful human
control and empowering human moral reasoning by providing comprehensible
information and insights on possible human moral blind spots.

介绍自我反思人工智能系统的概念，提出了一个融合了心理学、哲学、形式推理方法和机器学习方法的框架，旨在创建响应人类价值和社会规范的人工智能系统，可以增加有意义的人类控制并通过提供人类道德盲点的可理解信息和见解来赋予人类道德推理的能力。