Detecting and handling misspecified objectives, such as reward functions, has
been widely recognized as one of the central challenges within the domain of
Artificial Intelligence (AI) safety research. However, even with the
recognition of the importance of this problem, we are unaware of any works that
attempt to provide a clear definition for what constitutes (a) misspecified
objectives and (b) successfully resolving such misspecifications. In this work,
we use the theory of mind, i.e., the human user's beliefs about the AI agent,
as a basis to develop a formal explanatory framework called Expectation
Alignment (EAL) to understand the objective misspecification and its causes.
Our \EAL\ framework not only acts as an explanatory framework for existing
works but also provides us with concrete insights into the limitations of
existing methods to handle reward misspecification and novel solution
strategies. We use these insights to propose a new interactive algorithm that
uses the specified reward to infer potential user expectations about the system
behavior. We show how one can efficiently implement this algorithm by mapping
the inference problem into linear programs. We evaluate our method on a set of
standard Markov Decision Process (MDP) benchmarks.

使用以 Theory of Mind 为基础的解释性框架 Expectation Alignment (EAL) 来理解目标错配及其原因，并提出了一种基于指定奖励的交互式算法来推断用户对系统行为的期望。