We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.

本文研究了将强化学习转化为一系列关于策略空间的经验风险最小化问题的样本复杂度问题。本文提出的共产主义政策迭代的方差递减变种可以将从O（ε^-4）到O（ε^-3）的功能局部最优解的样本复杂度改进。在状态覆盖和政策完整性的假设下，该算法在采样O（ε^-2）次后享有ε-全局最优性，这改善了以前已经建立的O（ε^-3）样本要求。

方差降低的保守策略迭代