BriefGPT.xyz
Dec, 2017
关于某些基于梯度的时间差分离线学习算法的收敛性
On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning
HTML
PDF
Huizhen Yu
TL;DR
本文考虑了有限状态和折扣回报标准下的马尔科夫决策过程策略评估问题中的离策略时间差分(TD)学习方法,并针对几个基于梯度的TD算法提出了一组收敛性结果。
Abstract
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in
markov decision processes
with finite spaces and discounted reward criteria, and we present a collection of
convergence
re
→