Autonomous agents optimize the reward function we give them. What they don't
know is how hard it is for us to design a reward function that actually
captures what we want. When designing the reward, we might think of some
specific training scenarios, and make sure that the reward will lead to the
right behavior in those scenarios. Inevitably, agents encounter new scenarios
(e.g., new types of terrain) where optimizing that same reward may lead to
undesired behavior. Our insight is that reward functions are merely
observations about what the designer actually wants, and that they should be
interpreted in the context in which they were designed. We introduce inverse
reward design (IRD) as the problem of inferring the true objective based on the
designed reward and the training MDP. We introduce approximate methods for
solving IRD problems, and use their solution to plan risk-averse behavior in
test MDPs. Empirical results suggest that this approach can help alleviate
negative side effects of misspecified reward functions and mitigate reward
hacking.

设计奖励函数的困难性和可能带来的负面影响，本文介绍一种基于上下文推断真实目标的方法，以及应用该方法规避不当奖励导致的风险。实证研究表明，本方法有效减轻了误设奖励函数的负面影响，并减少了奖励欺骗的可能。