高效约束强化学习与普适参数化

May, 2024

Sample-Efficient Constrained Reinforcement Learning with General Parameterization

Washim Uddin Mondal, Vaneet Aggarwal

TL;DR在受限制的马尔可夫决策问题（CMDP）中，我们开发了原始-对偶加速自然策略梯度（PD-ANPG）算法，它保证了ε全局最优性差距和ε约束违反，样本复杂度为O(ε^-3)，从而在CMDP的样本复杂度上取得了O(ε^-1)的进展。

Abstract

We consider a constrained markov decision problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-base