We present new policy mirror descent (PMD) methods for solving reinforcement
learning (RL) problems with either strongly convex or general convex
regularizers. By exploring the structural properties of these overall highly
nonconvex problems we show that the PMD methods exhibit fast linear rate of
convergence to the global optimality. We develop stochastic counterparts of
these methods, and establish an ${\cal O}(1/\epsilon)$ (resp., ${\cal
O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with
strongly (resp., general) convex regularizers using different sampling schemes,
where $\epsilon$ denote the target accuracy. We further show that the
complexity for computing the gradients of these regularizers, if necessary, can
be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log
(1/\epsilon)\}$ (resp., ${\cal O} \{(\log_\gamma \epsilon )
(L/\epsilon)^{1/2}\}$)for problems with strongly (resp., general) convex
regularizers. Here $\gamma$ denotes the discounting factor. To the best of our
knowledge, these complexity bounds, along with our algorithmic developments,
appear to be new in both optimization and RL literature. The introduction of
these convex regularizers also greatly expands the flexibility and
applicability of RL models.

本文提出了新的政策镜反射（PMD）方法，用于解决具有强凸性或一般凸性正则化的强化学习（RL）问题，并使用不同的采样方案建立了这些问题的随机对应物。我们证明了 PMD 方法在快速收敛到全局最优解方面的线性速率，提出了计算这些正则化梯度的复杂度，并展示了此正则化的应用性。

针对强化学习的政策镜面下降算法：线性收敛、新采样复杂度和广义问题类

Policy Mirror Descent for Reinforcement Learning: Linear Convergence,  New Sampling Complexity, and Generalized Problem Classes

We consider the minimization of composite objective functions composed of the
expectation of quadratic functions and an arbitrary convex function. We study
the stochastic dual averaging algorithm with a constant step-size, showing that
it leads to a convergence rate of O(1/n) without strong convexity assumptions.
This thus extends earlier results on least-squares regression with the
Euclidean geometry to (a) all convex regularizers and constraints, and (b) all
geome-tries represented by a Bregman divergence. This is achieved by a new
proof technique that relates stochastic and deterministic recursions.

考虑由二次函数的期望值和任意凸函数组合成的复合目标函数的最小化问题，我们研究了随机双均值算法在恒定步长下的特性，证明其无需强凸假设即可获得 O (1/n) 的收敛速度，从而将欧几里得几何中关于最小二乘回归的较早结果扩展到了 (a) 所有凸正则化器以及约束条件，以及（b）由 Bregman 距离表示的所有几何形状。通过一种新的证明技巧来实现这一点，该技巧将随机和确定性递归联系起来。

收敛速度为 O（1/n）的随机组合最小二乘回归

Stochastic Composite Least-Squares Regression with convergence rate  O(1/n)

The use of convex regularizers allows for easy optimization, though they
often produce biased estimation and inferior prediction performance. Recently,
nonconvex regularizers have attracted a lot of attention and outperformed
convex ones. However, the resultant optimization problem is much harder. In
this paper, for a large class of nonconvex regularizers, we propose to move the
nonconvexity from the regularizer to the loss. The nonconvex regularizer is
then transformed to a familiar convex regularizer, while the resultant loss
function can still be guaranteed to be smooth. Learning with the convexified
regularizer can be performed by existing efficient algorithms originally
designed for convex regularizers (such as the proximal algorithm, Frank-Wolfe
algorithm, alternating direction method of multipliers and stochastic gradient
descent). Extensions are made when the convexified regularizer does not have
closed-form proximal step, and when the loss function is nonconvex, nonsmooth.
Extensive experiments on a variety of machine learning application scenarios
show that optimizing the transformed problem is much faster than running the
state-of-the-art on the original problem.

本文提出了将非凸正则化器中非凸性转移至损失函数的方法，使得正则化器可以转化为熟悉的凸正则化器，而损失函数仍然保证平滑，从而可以使用现有的用于凸正则化器的高效算法进行求解。实验证明，该方法在各种机器学习应用场景中均可显著提高求解速度。