反步时间差分学习

Feb, 2023

Backstepping Temporal Difference Learning

Han-Dong Lim, Donghwan Lee

TL;DR本文从纯控制理论的角度提供了对各种纠正离策略误差 TD 学习算法（包括 GTD 和 TDC）的统一视角，并提出了一种基于后掠技术的新的收敛算法，最终在标准 TD-learning 不稳定的环境中实验证实了该算法的收敛性。

Abstract

off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known t