BriefGPT.xyz
May, 2019
神经时序差分和Q-learning可以被证明收敛于全局最优解
Neural Temporal-Difference Learning Converges to Global Optima
HTML
PDF
Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang
TL;DR
通过超参数化来解决neural TD的优化非线性问题,证明了neural TD在策略评估中以次线性速率收敛于均方Bellman误差的全局最优解,并进一步连接到策略梯度算法的全局收敛。
Abstract
temporal-difference learning
(TD), coupled with
neural networks
, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, suc
→