Preference-based reinforcement learning (PbRL) has shown impressive
capabilities in training agents without reward engineering. However, a notable
limitation of PbRL is its dependency on substantial human feedback. This
dependency stems from the learning loop, which entails accurate reward learning
compounded with value/policy learning, necessitating a considerable number of
samples. To boost the learning loop, we propose SEER, an efficient PbRL method
that integrates label smoothing and policy regularization techniques. Label
smoothing reduces overfitting of the reward model by smoothing human preference
labels. Additionally, we bootstrap a conservative estimate $\widehat{Q}$ using
well-supported state-action pairs from the current replay memory to mitigate
overestimation bias and utilize it for policy learning regularization. Our
experimental results across a variety of complex tasks, both in online and
offline settings, demonstrate that our approach improves feedback efficiency,
outperforming state-of-the-art methods by a large margin. Ablation studies
further reveal that SEER achieves a more accurate Q-function compared to prior
work.

PbRL 方法 SEER 通过整合标签平滑和策略规则化技术，提高了反馈效率，取得了显著的性能优势。

通过对齐的经验估计实现高效的基于偏好的强化学习

Efficient Preference-based Reinforcement Learning via Aligned Experience  Estimation

Interactive reinforcement learning has shown promise in learning complex
robotic tasks. However, the process can be human-intensive due to the
requirement of large amount of interactive feedback. This paper presents a new
method that uses scores provided by humans, instead of pairwise preferences, to
improve the feedback efficiency of interactive reinforcement learning. Our key
insight is that scores can yield significantly more data than pairwise
preferences. Specifically, we require a teacher to interactively score the full
trajectories of an agent to train a behavioral policy in a sparse reward
environment. To avoid unstable scores given by human negatively impact the
training process, we propose an adaptive learning scheme. This enables the
learning paradigm to be insensitive to imperfect or unreliable scores. We
extensively evaluate our method on robotic locomotion and manipulation tasks.
The results show that the proposed method can efficiently learn near-optimal
policies by adaptive learning from scores, while requiring less feedback
compared to pairwise preference learning methods. The source codes are publicly
available at this https URL

本文提出了一种新的方法，使用由人提供的分数代替成对偏好，在交互式强化学习中提高反馈效率，该方法在机器人运动和操作任务中得到广泛评估，结果表明，该方法可以通过自适应学习从分数中高效学习接近最优策略，而无需像成对偏好学习方法那样需要更少的反馈。

通过自适应评分学习提高交互式强化学习的反馈效率

Boosting Feedback Efficiency of Interactive Reinforcement Learning by  Adaptive Learning from Scores

Preference-based reinforcement learning (PbRL) provides a natural way to
align RL agents' behavior with human desired outcomes, but is often restrained
by costly human feedback. To improve feedback efficiency, most existing PbRL
methods focus on selecting queries to maximally improve the overall quality of
the reward model, but counter-intuitively, we find that this may not
necessarily lead to improved performance. To unravel this mystery, we identify
a long-neglected issue in the query selection schemes of existing PbRL studies:
Query-Policy Misalignment. We show that the seemingly informative queries
selected to improve the overall quality of reward model actually may not align
with RL agents' interests, thus offering little help on policy learning and
eventually resulting in poor feedback efficiency. We show that this issue can
be effectively addressed via near on-policy query and a specially designed
hybrid experience replay, which together enforce the bidirectional query-policy
alignment. Simple yet elegant, our method can be easily incorporated into
existing approaches by changing only a few lines of code. We showcase in
comprehensive experiments that our method achieves substantial gains in both
human feedback and RL sample efficiency, demonstrating the importance of
addressing query-policy misalignment in PbRL tasks.

本文介绍了一种通过改变查询选择方案以达到查询与策略对齐，从而提高人类反馈效率的方法，并在详尽的实验中表明了该方法在提高人类反馈效率和 RL 样本效率方面的巨大优势。