We study sequential decision making problems aimed at maximizing the expected
total reward while satisfying a constraint on the expected total utility. We
employ the natural policy gradient method to solve the discounted
infinite-horizon optimal control problem for Constrained Markov Decision
Processes (constrained MDPs). Specifically, we propose a new Natural Policy
Gradient Primal-Dual (NPG-PD) method that updates the primal variable via
natural policy gradient ascent and the dual variable via projected sub-gradient
descent. Although the underlying maximization involves a nonconcave objective
function and a nonconvex constraint set, under the softmax policy
parametrization we prove that our method achieves global convergence with
sublinear rates regarding both the optimality gap and the constraint violation.
Such convergence is independent of the size of the state-action space, i.e., it
is~dimension-free. Furthermore, for log-linear and general smooth policy
parametrizations, we establish sublinear convergence rates up to a function
approximation error caused by restricted policy parametrization. We also
provide convergence and finite-sample complexity guarantees for two
sample-based NPG-PD algorithms. Finally, we use computational experiments to
showcase the merits and the effectiveness of our approach.

研究如何在满足预期总效用的约束条件下最大化预期总回报，提出了一种新的自然策略梯度原始 - 对偶方法来解决 Constrained Markov 决策过程（constrained MDPs）的折扣无限时域下的最优控制问题，在自然策略梯度上升和投影次梯度下降的影响下更新原始变量和对偶变量。