BriefGPT.xyz
Jul, 2024
离线基于偏好的强化学习的回顾式偏好学习
Hindsight Preference Learning for Offline Preference-based Reinforcement Learning
HTML
PDF
Chen-Xiao Gao, Shengjun Fang, Chenjun Xiao, Yang Yu, Zongzhang Zhang
TL;DR
提出了回顾性偏好学习 (Hindsight Preference Learning, HPL) 方法,通过建模人类偏好来优化离线数据集中的轨迹片段,利用回顾信息计算每步的奖励,以实现更强大和有利的奖励。
Abstract
offline preference-based reinforcement learning
(RL), which focuses on optimizing policies using
human preferences
between pairs of
trajectory se
→