BriefGPT.xyz
May, 2018
近似时序差分学习是可逆策略的梯度下降
Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies
HTML
PDF
Yann Ollivier
TL;DR
该论文探讨了在强化学习中,通过使用Dirichlet范数来代替标准的误差计算方法,即使在使用非线性参数近似的情况下,也可以确保TD算法的收敛性并解决梯度消失问题。
Abstract
In
reinforcement learning
,
temporal difference
(TD) is the most direct algorithm to learn the value function of a
policy
. For large or inf
→