无限时间平均回报马尔可夫决策过程的方差减少政策梯度方法

Apr, 2024

无限时间平均回报马尔可夫决策过程的方差减少政策梯度方法

Variance-Reduced Policy Gradient Approaches for Infinite Horizon Average Reward Markov Decision Processes

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR基于政策梯度的两种方法在无限时间平均奖励马尔可夫决策过程中引入了一般参数化。第一种方法采用隐式梯度传输进行方差降低，确保了预期后悔度为$\tilde{\mathcal{O}}(T^{3/5})$数量级。第二种方法以Hessian-based技术为基础，确保了预期后悔度为$\tilde{\mathcal{O}}(\sqrt{T})$数量级。这些结果显著提高了该问题的最新研究成果，其后悔度达到了$\tilde{\mathcal{O}}(T^{3/4})$数量级。

Abstract

We present two policy gradient-based methods with general parameterization in the context of infinite horizon average reward markov decision processes. The first approach employs Implicit Gradient Transport for <