短预热期折扣MDP的遗憾最优免模型强化学习

May, 2023

短预热期折扣MDP的遗憾最优免模型强化学习

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

Xiang Ji, Gen Li

TL;DR本文提出了一个模型自由的算法，通过方差降低和新颖的执行策略，解决了强化学习马尔可夫决策过程中无法实现遗憾最优和存在长时间燃烧期的问题，实现了短燃烧期下的最优采样效率。

Abstract

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted markov decision processes under the online setting. The existing algorithms either