Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at https://github.com/apple/ml-rlhf-hindsight-prior.

通过引入一种赋分策略（Hindsight PRIOR），将状态重要性纳入奖励学习中，可以改善策略学习速度、整体性能和奖励恢复，在元世界（20%）和DMC（15%）的运动和操纵任务上平均恢复了更多奖励，这表明赋分策略对奖励学习有很大的益处，并且状态重要性在前向动力学预测中是决策偏好的强有力指标。

基于人类偏好的奖励学习中基于先见性 L2 正则化