神经时序差分和Q-learning可以被证明收敛于全局最优解

May, 2019

神经时序差分和Q-learning可以被证明收敛于全局最优解

Neural Temporal-Difference Learning Converges to Global Optima

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

TL;DR通过超参数化来解决neural TD的优化非线性问题，证明了neural TD在策略评估中以次线性速率收敛于均方Bellman误差的全局最优解，并进一步连接到策略梯度算法的全局收敛。

Abstract

temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, suc