无限时域离策略估计中的双重稳健偏差降低

Oct, 2019

无限时域离策略估计中的双重稳健偏差降低

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu

TL;DR本文提出了一种基于学习价值函数的无偏增强方法，可用于减小通常重要性采样 (IS) 估计器的方差，消除因密度比估计误差引入的潜在高偏差，并证明其具有双倍的稳健性。

Abstract

Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon →