Deep reinforcement learning (DRL) remains challenging in tasks with sparse rewards. These sparse rewards often only indicate whether the task is partially or fully completed, meaning that many exploration actions must be performed before the agent obtains useful feedback. Hence, most existing DRL algorithms fail to learn feasible policies within a reasonable time frame. To overcome this problem, we develop an approach that exploits offline demonstration trajectories for faster and more efficient online RL in sparse reward settings. Our key insight is that by regarding offline demonstration trajectories as guidance, instead of imitating them, our method learns a policy whose state-action visitation marginal distribution matches that of offline demonstrations. Specifically, we introduce a novel trajectory distance based on maximum mean discrepancy (MMD) and formulate policy optimization as a distance-constrained optimization problem. Then, we show that this distance-constrained optimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from offline demonstrations. The proposed algorithm is evaluated on extensive discrete and continuous control tasks with sparse and deceptive rewards. The experimental results indicate that our proposed algorithm is significantly better than the baseline methods regarding diverse exploration and learning the optimal policy.

利用离线演示轨迹的强化学习方法，通过最大均值差异（MMD）计算轨迹距离并将策略优化视为一种受距离限制的优化问题，从离线演示学习到的形状奖励函数实现了与离线演示相匹配的状态-动作访问边缘分布，从而在稀疏奖励环境下提供了更快且更高效的在线强化学习方法。

基于轨迹的稀疏奖励策略优化