BriefGPT.xyz
Jan, 2019
带有UCB探索的Q-learning对于无限时域MDP具有样本效率
Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
HTML
PDF
Kefan Dong, Yuanhao Wang, Xiaoyu Chen, Liwei Wang
TL;DR
本文提出一种基于UCB探索策略的Q学习算法并将其应用于无限时间序列的马尔可夫决策问题,实验结果表明算法的探索样本复杂度的上限为O(SA/ε²(1-𝛾)⁷),此外该算法还可提高之前深度Q学习的表现。
Abstract
A fundamental question in
reinforcement learning
is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a
q-learning
algorithm with UCB
→