Reinforcement Learning (RL) algorithms suffer from the dependency on accurately engineered reward functions to properly guide the learning agents to do the required tasks. Preference-based reinforcement learning (PbRL) addresses that by utilizing human preferences as feedback from the experts instead of numeric rewards. Due to its promising advantage over traditional RL, PbRL has gained more focus in recent years with many significant advances. In this survey, we present a unified PbRL framework to include the newly emerging approaches that improve the scalability and efficiency of PbRL. In addition, we give a detailed overview of the theoretical guarantees and benchmarking work done in the field, while presenting its recent applications in complex real-world tasks. Lastly, we go over the limitations of the current approaches and the proposed future research directions.

该研究解决了强化学习中对准确设计奖励函数的依赖问题，通过利用人类偏好作为反馈，提升学习效率。论文提出了一个统一的基于偏好的强化学习框架，并详细审视了理论保证及实际应用，指出了当前研究的局限性及未来研究方向。该工作有助于推动基于偏好的强化学习在复杂任务中的应用及发展。

基于偏好的强化学习进展：综述