非马尔可夫奖励强化学习

Dec, 2019

Reinforcement Learning with Non-Markovian Rewards

Maor Gaon, Ronen I. Brafman

TL;DR研究了在具有非马尔可夫奖励的情况下，采用Q-learning和R-max算法和自动机学习算法相结合的方法用于策略学习并证明其中一些变体在极限状态下收敛到最优策略。

Abstract

The standard rl world model is that of a markov decision process (MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For ex