符号引导下基于事后先验的从人类偏好中学习奖励

Oct, 2022

符号引导下基于事后先验的从人类偏好中学习奖励

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

Mudit Verma, Katherine Metcalf

TL;DR本文研究强化学习中的奖励函数的学习，提出了利用先验知识和偏好数据约束奖励函数的PRIOR框架，可以降低50%的反馈数量并提高奖励函数学习和代理性能。

Abstract

Specifying rewards for reinforcement learned (RL) agents is challenging. Preference-based RL (PbRL) mitigates these challenges by inferring a reward from feedback over sets of trajectories. However, the effectiveness of PbRL is limited by the amount of feedback needed to reliably recover the structure of the target reward. We present the PRIor Over Rewards (