统计高效的离线策略梯度

Feb, 2020

Statistically Efficient Off-Policy Policy Gradients

Nathan Kallus, Masatoshi Uehara

TL;DR本文研究了如何在离线数据中高效地估计策略梯度。我们提出了一个元算法，在不需要参数假设的情况下实现了可行均方误差的渐进下界，并且具有三重双重稳健性质。我们还讨论了如何估计算法所依赖的干扰量。最后，我们证明了当我们朝着新的策略梯度方向迈出步伐时，我们接近稳态点的速度是有保证的。

Abstract

policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimat