Trust-region methods based on Kullback-Leibler divergence are pervasively
used to stabilize policy optimization in reinforcement learning. In this paper,
we exploit more flexible metrics and examine two natural extensions of policy
optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein
policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of
restricting the policy to a parametric distribution class, we directly optimize
the policy distribution and derive their closed-form policy updates based on
the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic
performance improvement, and SPO provably converges to WPO as the entropic
regularizer diminishes. Moreover, we prove that with a decaying Lagrangian
multiplier to the trust region constraint, both methods converge to global
optimality. Experiments across tabular domains, robotic locomotion, and
continuous control tasks further demonstrate the performance improvement of
both approaches, more robustness of WPO to sample insufficiency, and faster
convergence of SPO, over state-of-art policy gradient methods.

本文探讨基于 KL 散度的信任域方法在强化学习中的应用，进而提出基于 Wasserstein 和 Sinkhorn 两种新的信任域方法用于策略优化，并在多个任务中进行了实验验证。

基于度量感知的信赖域算法保证收敛的策略优化

Provably Convergent Policy Optimization via Metric-aware Trust Region  Methods

We consider variants of trust-region and cubic regularization methods for
non-convex optimization, in which the Hessian matrix is approximated. Under
mild conditions on the inexact Hessian, and using approximate solution of the
corresponding sub-problems, we provide iteration complexity to achieve $
\epsilon $-approximate second-order optimality which have shown to be tight.
Our Hessian approximation conditions constitute a major relaxation over the
existing ones in the literature. Consequently, we are able to show that such
mild conditions allow for the construction of the approximate Hessian through
various random sampling methods. In this light, we consider the canonical
problem of finite-sum minimization, provide appropriate uniform and non-uniform
sub-sampling strategies to construct such Hessian approximations, and obtain
optimal iteration complexity for the corresponding sub-sampled trust-region and
cubic regularization methods.

本文研究了基于 Hessian 矩阵近似的非凸优化中信任域和立方正则化方法的变体。通过对不精确 Hessian 矩阵的渐进解和相应子问题的近似解，提供了迭代复杂度，以实现达到二阶最优条件的近似解，并且在现有文献中条件松弛。