BriefGPT.xyz
Apr, 2020
上置信强化学习中的探索优化
Tightening Exploration in Upper Confidence Reinforcement Learning
HTML
PDF
Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
TL;DR
UCRL3算法是在UCRL2算法的基础上引入了专业时间均匀集中不等式和每个状态-动作对奖励和转移分布的置信区间等改进,以减少探索来优化分布,理论上改善了UCRL2算法,在标准环境下的数值实验也证明了UCRL3算法的实用性和有效性。
Abstract
The upper confidence
reinforcement learning
(UCRL2) strategy introduced in (Jaksch et al., 2010) is a popular method to perform
regret minimization
in unknown discrete Markov Decision Processes under the average-
→