多项式时间的无界强化学习：静态策略的威力

Mar, 2022

多项式时间的无界强化学习：静态策略的威力

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Zihan Zhang, Xiangyang Ji, Simon S. Du

TL;DR本文提出了第一个针对有限MDP多项式时间算法，具有独立于计划时间的后悔范围，并通过一系列的新结构引理，建立了稳定性和专注性，提高了MDP的近似能力和性能。

Abstract

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $