TL;DR研究了强化学习中 off-policy value evaluation 的问题,提出了一种将 doubly robust estimator 用于序列决策问题的方法,可以保证无偏差并且方差较低,在多个基准问题中都具有较高的准确度,并且可以作为安全策略改进的子程序。
Abstract
We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before act