BriefGPT.xyz
Apr, 2024
基于人类偏好的奖励学习中基于先见性 L2 正则化
Hindsight PRIORs for Reward Learning from Human Preferences
HTML
PDF
Mudit Verma, Katherine Metcalf
TL;DR
通过引入一种赋分策略(Hindsight PRIOR),将状态重要性纳入奖励学习中,可以改善策略学习速度、整体性能和奖励恢复,在元世界(20%)和DMC(15%)的运动和操纵任务上平均恢复了更多奖励,这表明赋分策略对奖励学习有很大的益处,并且状态重要性在前向动力学预测中是决策偏好的强有力指标。
Abstract
preference based reinforcement learning
(PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the
credit
→