We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Under mild reward assumptions, our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for a diverse class of cost criteria. This class requires that the cost of a policy can be computed recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work not only provides provably efficient algorithms to address real-world challenges in decision-making but also offers a unifying theory for the efficient computation of constrained deterministic policies.

我们提出了一种新颖的算法，能够高效计算约束强化学习问题的近似最优确定性策略。该算法通过三个关键思想进行组合：（1）价值需求增强，（2）动作空间的近似动态规划，以及（3）时间空间的取整。在较弱的奖励假设下，我们的算法构成了一个对多样化成本准则的全多项式时间近似方案。该类准则要求以递归方式计算策略的成本，涉及时间和状态空间，包括经典期望、几乎确定和实时约束。我们的工作不仅为解决实际决策中的挑战提供了经过证明的高效算法，还为高效计算约束性确定性策略提供了统一的理论。

多项式时间下的受限强化学习确定性策略