TL;DR使用以 Theory of Mind 为基础的解释性框架 Expectation Alignment (EAL) 来理解目标错配及其原因,并提出了一种基于指定奖励的交互式算法来推断用户对系统行为的期望。
Abstract
Detecting and handling misspecified objectives, such as reward functions, has
been widely recognized as one of the central challenges within the domain of
Artificial Intelligence (AI) safety research. However, even with the
recognition of the importance of this problem, we are unaware