Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness efficiently.

这篇论文讲述了在马尔科夫决策过程中(off-policy evaluation)基于无记忆存储的状态、行动和奖励的情况下，使用交叉折叠法来计算$q$-functions和边际密度比率的双重强化学习(DRL)的有效性研究。研究表明，在第四次方根率下估算两个因素时，DRL具有高效性，并且当仅一个因素一致时也具有双重正确性。

马尔科夫决策过程中的双重强化学习，用于高效的离线策略评估