Trust region policy optimization (TRPO) is a popular and empirically
successful policy search algorithm in Reinforcement Learning (RL) in which a
surrogate problem, that restricts consecutive policies to be 'close' to one
another, is iteratively solved. Nevertheless, TRPO has been considered a
heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show
that the adaptive scaling mechanism used in TRPO is in fact the natural "RL
version" of traditional trust-region methods from convex analysis. We first
analyze TRPO in the planning setting, in which we have access to the model and
the entire state space. Then, we consider sample-based TRPO and establish
$\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the
adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for
which we prove fast rates of $\tilde O(1/N)$, much like results in convex
optimization. This is the first result in RL of better rates when regularizing
the instantaneous cost or reward.

本文考虑在强化学习中使用的一种流行算法 Trust region policy optimization（TRPO）与传统的凸分析中自然的信任域方法之间的关系，证明了 TRPO 的适应性调节机制实际上是传统信任域方法的 RL 版本，并在规则化 MDP 中，展示了快速收敛速率的结果，这是关于规则化即时成本或奖励时在 RL 中的首个更好的结果。