BriefGPT.xyz
Jan, 2024
基于轨迹的稀疏奖励策略优化
Trajectory-Oriented Policy Optimization with Sparse Rewards
HTML
PDF
Guojian Wang, Faguo Wu, Xiao Zhang
TL;DR
利用离线演示轨迹的强化学习方法,通过最大均值差异(MMD)计算轨迹距离并将策略优化视为一种受距离限制的优化问题,从离线演示学习到的形状奖励函数实现了与离线演示相匹配的状态-动作访问边缘分布,从而在稀疏奖励环境下提供了更快且更高效的在线强化学习方法。
Abstract
deep reinforcement learning
(DRL) remains challenging in tasks with
sparse rewards
. These
sparse rewards
often only indicate whether the t
→