Preference-based reinforcement learning (PbRL) has shown impressive
capabilities in training agents without reward engineering. However, a notable
limitation of PbRL is its dependency on substantial human feedback. This
dependency stems from the learning loop, which entails accurate reward learning
compounded with value/policy learning, necessitating a considerable number of
samples. To boost the learning loop, we propose SEER, an efficient PbRL method
that integrates label smoothing and policy regularization techniques. Label
smoothing reduces overfitting of the reward model by smoothing human preference
labels. Additionally, we bootstrap a conservative estimate $\widehat{Q}$ using
well-supported state-action pairs from the current replay memory to mitigate
overestimation bias and utilize it for policy learning regularization. Our
experimental results across a variety of complex tasks, both in online and
offline settings, demonstrate that our approach improves feedback efficiency,
outperforming state-of-the-art methods by a large margin. Ablation studies
further reveal that SEER achieves a more accurate Q-function compared to prior
work.

PbRL 方法 SEER 通过整合标签平滑和策略规则化技术，提高了反馈效率，取得了显著的性能优势。

通过对齐的经验估计实现高效的基于偏好的强化学习

Efficient Preference-based Reinforcement Learning via Aligned Experience  Estimation

Preference-based Reinforcement Learning (PbRL) avoids the need for reward
engineering by harnessing human preferences as the reward signal. However,
current PbRL algorithms over-reliance on high-quality feedback from domain
experts, which results in a lack of robustness. In this paper, we present RIME,
a robust PbRL algorithm for effective reward learning from noisy preferences.
Our method incorporates a sample selection-based discriminator to dynamically
filter denoised preferences for robust training. To mitigate the accumulated
error caused by incorrect selection, we propose to warm start the reward model,
which additionally bridges the performance gap during transition from
pre-training to online training in PbRL. Our experiments on robotic
manipulation and locomotion tasks demonstrate that RIME significantly enhances
the robustness of the current state-of-the-art PbRL method. Ablation studies
further demonstrate that the warm start is crucial for both robustness and
feedback-efficiency in limited-feedback cases.

通过使用人类偏好作为奖励信号，基于偏好的强化学习（PbRL）避免了对奖励设计的需求。然而，当前的 PbRL 算法过于依赖领域专家的高质量反馈，导致鲁棒性不足。本文提出 RIME，一种从嘈杂偏好中有效学习奖励的鲁棒 PbRL 算法。我们的方法结合了基于样本选择的鉴别器，动态过滤去噪偏好以进行鲁棒训练。为了减轻由于错误选择引起的积累误差，我们建议热启动奖励模型，从而在从预训练到在线训练的过渡中弥合性能差距。我们在机器人操作和运动任务上的实验证明，RIME 显著提高了当前最先进的 PbRL 方法的鲁棒性。消融研究进一步证明了热启动对于有限反馈情况下的鲁棒性和反馈效率都至关重要。