We study the sample complexity of reducing reinforcement learning to a
sequence of empirical risk minimization problems over the policy space. Such
reductions-based algorithms exhibit local convergence in the function space, as
opposed to the parameter space for policy gradient algorithms, and thus are
unaffected by the possibly non-linear or discontinuous parameterization of the
policy class. We propose a variance-reduced variant of Conservative Policy
Iteration that improves the sample complexity of producing a
$\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to
$O(\varepsilon^{-3})$. Under state-coverage and policy-completeness
assumptions, the algorithm enjoys $\varepsilon$-global optimality after
sampling $O(\varepsilon^{-2})$ times, improving upon the previously established
$O(\varepsilon^{-3})$ sample requirement.

本文研究了将强化学习转化为一系列关于策略空间的经验风险最小化问题的样本复杂度问题。本文提出的共产主义政策迭代的方差递减变种可以将从 O（ε^-4）到 O（ε^-3）的功能局部最优解的样本复杂度改进。在状态覆盖和政策完整性的假设下，该算法在采样 O（ε^-2）次后享有 ε- 全局最优性，这改善了以前已经建立的 O（ε^-3）样本要求。

方差降低的保守策略迭代

Variance-Reduced Conservative Policy Iteration

Conservative Policy Iteration (CPI) is a founding algorithm of Approximate
Dynamic Programming (ADP). Its core principle is to stabilize greediness
through stochastic mixtures of consecutive policies. It comes with strong
theoretical guarantees, and inspired approaches in deep Reinforcement Learning
(RL). However, CPI itself has rarely been implemented, never with neural
networks, and only experimented on toy problems. In this paper, we show how CPI
can be practically combined with deep RL with discrete actions. We also
introduce adaptive mixture rates inspired by the theory. We experiment
thoroughly the resulting algorithm on the simple Cartpole problem, and validate
the proposed method on a representative subset of Atari games. Overall, this
work suggests that revisiting classic ADP may lead to improved and more stable
deep RL algorithms.

本文研究了将经典的 Conservative Policy Iteration 算法应用于深度强化学习中的实际问题，并引入了自适应混合比率的概念，通过在 Cartpole 问题和 Atari 游戏中的实验验证了该算法的有效性和稳定性，表明重新审视经典的 Approximate Dynamic Programming 可能会导致改进和更稳定的深度强化学习算法。

深度保守策略迭代

Deep Conservative Policy Iteration

We consider the infinite-horizon discounted optimal control problem
formalized by Markov Decision Processes. We focus on Policy Search algorithms,
that compute an approximately optimal policy by following the standard Policy
Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford,
2002; Lazaric et al., 2010). We describe existing and a few new performance
bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et
al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI)
(Kakade and Langford, 2002). By paying a particular attention to the
concentrability constants involved in such guarantees, we notably argue that
the guarantee of CPI is much better than that of DPI, but this comes at the
cost of a relative--exponential in $\frac{1}{\epsilon}$-- increase of time
complexity. We then describe an algorithm, Non-Stationary Direct Policy
Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search
by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon
situation or 2) a simplified version of the Non-Stationary PI with growing
period of Scherrer and Lesner (2012). We provide an analysis of this algorithm,
that shows in particular that it enjoys the best of both worlds: its
performance guarantee is similar to that of CPI, but within a time complexity
similar to that of DPI.

本篇研究考虑了马尔科夫决策过程 (Markov Decision Processes) 的无限时间折扣优化控制问题，并提供了 Policy Search 算法以及 Direct Policy Iteration 和 Conservative Policy Iteration 的性能保证，同时提出了 Non-Stationary Direct Policy Iteration 算法，并证明其时间复杂度类似于 DPI，性能保证好于 DPI，且与 CPI 相当。