BriefGPT.xyz
Nov, 2023
离线强化学习的预测离策略Q学习(POP-QL)的稳定化
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning
HTML
PDF
Melrose Roderick, Gaurav Manek, Felix Berkenkamp, J. Zico Kolter
TL;DR
稳定离线策略Q学习的新方法,通过重新加权离线样本和限制策略以防止发散和减少价值逼近错误,能在标准基准测试中竞争性地表现,并在数据收集策略明显次优的任务中胜过竞争方法。
Abstract
A key problem in
off-policy reinforcement learning
(RL) is the mismatch, or
distribution shift
, between the dataset and the distribution over states and actions visited by the learned policy. This problem is exac
→