We consider the reinforcement learning problem for the constrained Markov
decision process (CMDP), which plays a central role in satisfying safety or
resource constraints in sequential learning and decision-making. In this
problem, we are given finite resources and a MDP with unknown transition
probabilities. At each stage, we take an action, collecting a reward and
consuming some resources, all assumed to be unknown and need to be learned over
time. In this work, we take the first step towards deriving optimal
problem-dependent guarantees for the CMDP problems. We derive a logarithmic
regret bound, which translates into a
$O(\frac{\kappa}{\epsilon}\cdot\log^2(1/\epsilon))$ sample complexity bound,
with $\kappa$ being a problem-dependent parameter, yet independent of
$\epsilon$. Our sample complexity bound improves upon the state-of-art
$O(1/\epsilon^2)$ sample complexity for CMDP problems established in the
previous literature, in terms of the dependency on $\epsilon$. To achieve this
advance, we develop a new framework for analyzing CMDP problems. To be
specific, our algorithm operates in the primal space and we resolve the primal
LP for the CMDP problem at each period in an online manner, with
\textit{adaptive} remaining resource capacities. The key elements of our
algorithm are: i). an eliminating procedure that characterizes one optimal
basis of the primal LP, and; ii) a resolving procedure that is adaptive to the
remaining resources and sticks to the characterized optimal basis.

我们研究了强化学习问题中的约束马尔可夫决策过程（CMDP），并通过优化算法对 CMDP 问题的样本复杂度提出了改进，实现了优化的问题相关保证。

在约束马尔可夫决策过程中实现 $\tilde {O}(1/ε)$ 的样本复杂性

Achieving $\tilde{O}(1/ε)$ Sample Complexity for Constrained  Markov Decision Process

Statistical performance bounds for reinforcement learning (RL) algorithms can
be critical for high-stakes applications like healthcare. This paper introduces
a new framework for theoretically measuring the performance of such algorithms
called Uniform-PAC, which is a strengthening of the classical Probably
Approximately Correct (PAC) framework. In contrast to the PAC framework, the
uniform version may be used to derive high probability regret guarantees and so
forms a bridge between the two setups that has been missing in the literature.
We demonstrate the benefits of the new framework for finite-state episodic MDPs
with a new algorithm that is Uniform-PAC and simultaneously achieves optimal
regret and PAC guarantees except for a factor of the horizon.

本篇论文提出了一种新的理论框架 Uniform-PAC，用于测量强化学习算法的性能，可以为高风险应用程序如医疗保健等提供统计性能保障。该框架与传统的 PAC 框架相比，可以提供高概率的后悔保证，因此形成了一座桥梁，填补了文献中缺少的两个设置之间的空白。针对有限状态的情境马尔科夫决策过程，我们演示了新算法的优点，该算法 Uniform-PAC 并同时实现了最优保障和 PAC 保障，除了地平线因素外。