Reward design, the problem of selecting an appropriate reward function for an AI system, is both critically important, as it encodes the task the system should perform, and challenging, as it requires reasoning about and understanding the agent's environment in detail. AI practitioners often iterate on the reward function for their systems in a trial-and-error process to get their desired behavior. Inverse reward design (IRD) is a preference inference method that infers a true reward function from an observed, possibly misspecified, proxy reward function. This allows the system to determine when it should trust its observed reward function and respond appropriately. This has been shown to avoid problems in reward design such as negative side-effects (omitting a seemingly irrelevant but important aspect of the task) and reward hacking (learning to exploit unanticipated loopholes). In this paper, we actively select the $\textit{set of proxy reward functions}$ available to the designer. This improves the quality of inference and simplifies the associated reward design problem. We present two types of queries: discrete queries, where the system designer chooses from a discrete set of reward functions, and feature queries, where the system queries the designer for weights on a small set of features. We evaluate this approach with experiments in a personal shopping assistant domain and a 2D navigation domain. We find that our approach leads to reduced regret at test time compared with vanilla IRD. Our results indicate that actively selecting the set of available reward functions is a promising direction to improve the efficiency and effectiveness of reward design.

通过与用户交互，选择最能反映真实回报的问题来迭代AI代理的奖励函数设计，我们的方法优于Inverse Reward Design，且可以推断非线性奖励函数，包括可解释的线性奖励函数。

主动反向奖励设计