Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z^{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z^{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.

通过从代理行为的二进制反馈中学习到的动态感知奖励函数，我们展示了动态感知奖励函数如何使得偏好基础增强学习的采样效率提高一个数量级。通过迭代学习动态感知的状态-行动表示并从中引导基于偏好的奖励函数，我们实现了更快的策略学习和更好的最终策略性能。例如，在四足行走、行走者行走和猎豹奔跑中，在50个偏好标签的情况下，我们实现了与现有方法500个偏好标签相同的性能，并恢复了83%和66%的地面真实奖励策略性能，而它们分别只有38%和21%。这些性能提升证明了明确学习动态感知奖励模型的好处。

具有动力学感知奖励的样本高效偏好强化学习