BriefGPT.xyz
Nov, 2012
在有限时间无限阶段马尔可夫决策过程中使用非平稳策略
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
HTML
PDF
Bruno Scherrer, Boris Lesner
TL;DR
论文提出了为无限时域的马尔科夫决策过程(即 MDP)设计出计算非平稳最优策略的算法,其中引入了价值迭代和策略迭代,可以使得计算出的平稳或非平稳最优策略与实际的最优策略的距离最多相差一定精度。
Abstract
We consider infinite-horizon stationary $\gamma$-discounted
markov decision processes
, for which it is known that there exists a stationary optimal policy. Using Value and
policy iteration
with some error $\epsil
→