BriefGPT.xyz
May, 2019
强化学习的变分遗憾界
Variational Regret Bounds for Reinforcement Learning
HTML
PDF
Pratik Gajane, Ronald Ortner, Peter Auer
TL;DR
该研究针对马尔可夫决策过程中的无折扣强化学习问题提出了一种算法,并提供了针对最优非静态策略的性能保证。给出了在MDP总变差方面的差错的上限,这是一般强化学习设置的第一个变分差错界限。
Abstract
We consider
undiscounted reinforcement learning
in
markov decision processes
(MDPs) where both the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this prob
→