We consider a regularized expected reward optimization problem in the non-oblivious setting that covers many existing problems in reinforcement learning (RL). In order to solve such an optimization problem, we apply and analyze the classical stochastic proximal gradient method. In particular, the method has shown to admit an $O(\epsilon^{-4})$ sample complexity to an $\epsilon$-stationary point, under standard conditions. Since the variance of the classical stochastic gradient estimator is typically large which slows down the convergence, we also apply an efficient stochastic variance-reduce proximal gradient method with an importance sampling based ProbAbilistic Gradient Estimator (PAGE). To the best of our knowledge, the application of this method represents a novel approach in addressing the general regularized reward optimization problem. Our analysis shows that the sample complexity can be improved from $O(\epsilon^{-4})$ to $O(\epsilon^{-3})$ under additional conditions. Our results on the stochastic (variance-reduced) proximal gradient method match the sample complexity of their most competitive counterparts under similar settings in the RL literature.

基于正则化预期奖励优化问题，我们应用分析了经典的随机近端梯度方法，在标准条件下表明该方法在收敛到ε-稳定点的样本复杂度为O(ε^{-4})。考虑到经典随机梯度估计器的方差通常较大，导致收敛速度变慢，我们还应用了一种高效的随机方差缩减近端梯度方法与基于重要性采样的概率梯度估计器(PAGE)。我们的分析结果表明，在附加条件下，样本复杂度可以从O(ε^{-4})提高到O(ε^{-3})。在强化学习文献中的类似设置下，我们的结果与竞争对手的随机(方差减小)近端梯度方法的样本复杂度相匹配。

关于随机（方差减少）近端梯度法在正则化期望回报优化中的应用