BriefGPT.xyz
Mar, 2022
多项式时间的无界强化学习:静态策略的威力
Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies
HTML
PDF
Zihan Zhang, Xiangyang Ji, Simon S. Du
TL;DR
本文提出了第一个针对有限MDP多项式时间算法,具有独立于计划时间的后悔范围,并通过一系列的新结构引理,建立了稳定性和专注性,提高了MDP的近似能力和性能。
Abstract
This paper gives the first
polynomial-time algorithm
for tabular Markov Decision Processes (MDP) that enjoys a
regret bound
\emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $
→