Actor Critic methods have found immense applications on a wide range of
Reinforcement Learning tasks especially when the state-action space is large.
In this paper, we consider actor critic and natural actor critic algorithms
with function approximation for constrained Markov decision processes (C-MDP)
involving inequality constraints and carry out a non-asymptotic analysis for
both of these algorithms in a non-i.i.d (Markovian) setting. We consider the
long-run average cost criterion where both the objective and the constraint
functions are suitable policy-dependent long-run averages of certain prescribed
cost functions. We handle the inequality constraints using the Lagrange
multiplier method. We prove that these algorithms are guaranteed to find a
first-order stationary point (i.e., $\Vert \nabla L(\theta,\gamma)\Vert_2^2
\leq \epsilon$) of the performance (Lagrange) function $L(\theta,\gamma)$, with
a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ in the case of
both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic
(C-NAC) algorithms.We also show the results of experiments on a few different
grid world settings and observe good empirical performance using both of these
algorithms. In particular, for large grid sizes, Constrained Natural Actor
Critic shows slightly better results than Constrained Actor Critic while the
latter is slightly better for a small grid size.

通过应用 Lagrange 乘数法，我们对带有不等式约束的 C-MDP 中的 actor critic 和 natural actor critic 算法进行了非渐近分析，并证明这些算法在非独立同分布（Markovian）环境中能够找到性能函数的一阶稳定点，其采样复杂度分别为 ε^{-2.5}（C-AC 算法和 C-NAC 算法）。我们还在几个不同的网格环境中进行了实验，并观察到这两个算法在大网格尺寸上的良好实验结果，受限的自然 actor critic 稍微优于受限的 actor critic，而对于小网格尺寸，后者稍微优于前者。

约束的演员 - 评论家算法和约束的自然演员 - 评论家算法的有限时间分析

Finite Time Analysis of Constrained Actor Critic and Constrained Natural  Actor Critic Algorithms

In safe MDP planning, a cost function based on the current state and action
is often used to specify safety aspects. In the real world, often the state
representation used may lack sufficient fidelity to specify such safety
constraints. Operating based on an incomplete model can often produce
unintended negative side effects (NSEs). To address these challenges, first, we
associate safety signals with state-action trajectories (rather than just an
immediate state-action). This makes our safety model highly general. We also
assume categorical safety labels are given for different trajectories, rather
than a numerical cost function, which is harder to specify by the problem
designer. We then employ a supervised learning model to learn such
non-Markovian safety patterns. Second, we develop a Lagrange multiplier method,
which incorporates the safety model and the underlying MDP model in a single
computation graph to facilitate agent learning of safe behaviors. Finally, our
empirical results on a variety of discrete and continuous domains show that
this approach can satisfy complex non-Markovian safety constraints while
optimizing an agent's total returns, is highly scalable, and is also better
than the previous best approach for Markovian NSEs.

本文介绍了一种用于安全 MDP 规划的方法，它基于对状态 - 动作轨迹的安全度量，用监督学习模型学习非马尔可夫安全模式，并通过 Lagrange 乘子方法和计算图优化代理学习安全行为。实验结果表明，该方法可以满足非马尔可夫的安全约束条件，比马尔可夫 NCE 的先前最佳方法更好。