BriefGPT.xyz
Dec, 2019
非马尔可夫奖励强化学习
Reinforcement Learning with Non-Markovian Rewards
HTML
PDF
Maor Gaon, Ronen I. Brafman
TL;DR
研究了在具有非马尔可夫奖励的情况下,采用Q-learning和R-max算法和自动机学习算法相结合的方法用于策略学习并证明其中一些变体在极限状态下收敛到最优策略。
Abstract
The standard
rl
world model is that of a
markov decision process
(MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For ex
→