In the diverse array of work investigating the nature of human values from
psychology, philosophy and social sciences, there is a clear consensus that
values guide behaviour. More recently, a recognition that values provide a
means to engineer ethical AI has emerged. Indeed, Stuart Russell proposed
shifting AI's focus away from simply ``intelligence'' towards intelligence
``provably aligned with human values''. This challenge -- the value alignment
problem -- with others including an AI's learning of human values, aggregating
individual values to groups, and designing computational mechanisms to reason
over values, has energised a sustained research effort. Despite this, no
formal, computational definition of values has yet been proposed. We address
this through a formal conceptual framework rooted in the social sciences, that
provides a foundation for the systematic, integrated and interdisciplinary
investigation into how human values can support designing ethical AI.

通过社会科学根植的正式概念框架，系统、集成和跨学科地探究人类价值如何支持设计道德人工智能，从而解决价值对齐问题和其他相关的挑战，如人工智能学习人类价值观、将个人价值观聚合到群体中和设计计算机机制来处理价值观。

伦理人工智能的人类价值计算框架

A computational framework of human values for ethical AI

When AI agents don't align their actions with human values they may cause
serious harm. One way to solve the value alignment problem is by including a
human operator who monitors all of the agent's actions. Despite the fact, that
this solution guarantees maximal safety, it is very inefficient, since it
requires the human operator to dedicate all of his attention to the agent. In
this paper, we propose a much more efficient solution that allows an operator
to be engaged in other activities without neglecting his monitoring task. In
our approach the AI agent requests permission from the operator only for
critical actions, that is, potentially harmful actions. We introduce the
concept of critical actions with respect to AI safety and discuss how to build
a model that measures action criticality. We also discuss how the operator's
feedback could be used to make the agent smarter.

本文提出了一种更有效的解决 AI 安全中的价值同步问题的解决方案，其方法是利用关键指标来测量动作的重要性，只在关键动作时需要操作者进行干预，操作者在处理其他工作时也能保证安全。

AI 安全中的临界性概念

The Concept of Criticality in AI Safety

For an autonomous system to be helpful to humans and to pose no unwarranted
risks, it needs to align its values with those of the humans in its environment
in such a way that its actions contribute to the maximization of value for the
humans. We propose a formal definition of the value alignment problem as
cooperative inverse reinforcement learning (CIRL). A CIRL problem is a
cooperative, partial-information game with two agents, human and robot; both
are rewarded according to the human's reward function, but the robot does not
initially know what this is. In contrast to classical IRL, where the human is
assumed to act optimally in isolation, optimal CIRL solutions produce behaviors
such as active teaching, active learning, and communicative actions that are
more effective in achieving value alignment. We show that computing optimal
joint policies in CIRL games can be reduced to solving a POMDP, prove that
optimality in isolation is suboptimal in CIRL, and derive an approximate CIRL
algorithm.

本文提出了以合作式逆强化学习（CIRL）为基础的价值对齐问题的正式定义，其中机器人和人类是两个代理人，目标是最大化人类的奖励函数，该问题可以转化为 POMDP 问题，我们还提出了一种近似的 CIRL 算法。