Preference Based Reinforcement Learning has shown much promise for utilizing human binary feedback on queried trajectory pairs to recover the underlying reward model of the Human in the Loop (HiL). While works have attempted to better utilize the queries made to the human, in this work we make two observations about the unlabeled trajectories collected by the agent and propose two corresponding loss functions that ensure participation of unlabeled trajectories in the reward learning process, and structure the embedding space of the reward model such that it reflects the structure of state space with respect to action distances. We validate the proposed method on one locomotion domain and one robotic manipulation task and compare with the state-of-the-art baseline PEBBLE. We further present an ablation of the proposed loss components across both the domains and find that not only each of the loss components perform better than the baseline, but the synergic combination of the two has much better reward recovery and human feedback sample efficiency.

本文提出了两个损失函数，利用未标记的轨迹集参与奖励学习过程，并结构化奖励模型的嵌入空间以反映状态空间与操作距离之间的结构，旨在提高样本效率和奖励恢复能力，该方法在基于机械臂操作的领域上比当前的最优算法PEBBLE表现更好。

利用未标记的数据进行高效反馈的基于人类偏好的强化学习