BriefGPT.xyz
Oct, 2018
打破视野的诅咒:无穷视野离线估计
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
HTML
PDF
Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou
TL;DR
本文提出了一种新的离线策略估计方法,其中将重要性采样直接应用于平稳态访问分布,从而避免了现有估计器所面临的方差爆炸问题。通过仅从行为分布中采样轨迹,我们开发了一种估计密度比的新方法,并为估算问题设计了mini-max损失函数,并推导出了RKHS情况下的封闭形式解决方案。
Abstract
We consider the
off-policy estimation
problem of estimating the expected reward of a target policy using samples collected by a different behavior policy.
importance sampling
(IS) has been a key technique to deri
→