Preference-based Reinforcement Learning (PbRL) avoids the need for reward
engineering by harnessing human preferences as the reward signal. However,
current PbRL algorithms over-reliance on high-quality feedback from domain
experts, which results in a lack of robustness. In this paper, we present RIME,
a robust PbRL algorithm for effective reward learning from noisy preferences.
Our method incorporates a sample selection-based discriminator to dynamically
filter denoised preferences for robust training. To mitigate the accumulated
error caused by incorrect selection, we propose to warm start the reward model,
which additionally bridges the performance gap during transition from
pre-training to online training in PbRL. Our experiments on robotic
manipulation and locomotion tasks demonstrate that RIME significantly enhances
the robustness of the current state-of-the-art PbRL method. Ablation studies
further demonstrate that the warm start is crucial for both robustness and
feedback-efficiency in limited-feedback cases.

通过使用人类偏好作为奖励信号，基于偏好的强化学习（PbRL）避免了对奖励设计的需求。然而，当前的 PbRL 算法过于依赖领域专家的高质量反馈，导致鲁棒性不足。本文提出 RIME，一种从嘈杂偏好中有效学习奖励的鲁棒 PbRL 算法。我们的方法结合了基于样本选择的鉴别器，动态过滤去噪偏好以进行鲁棒训练。为了减轻由于错误选择引起的积累误差，我们建议热启动奖励模型，从而在从预训练到在线训练的过渡中弥合性能差距。我们在机器人操作和运动任务上的实验证明，RIME 显著提高了当前最先进的 PbRL 方法的鲁棒性。消融研究进一步证明了热启动对于有限反馈情况下的鲁棒性和反馈效率都至关重要。