针对具有近似最优遗憾度的无限时间平均收益 MDP 的无模型学习算法

Jun, 2020

针对具有近似最优遗憾度的无限时间平均收益 MDP 的无模型学习算法

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo

TL;DR提出了一种基于EE-QL，结合浓度逼近和无模型弱交流 MDPs 的无模型学习算法，实现了与最佳已知基于模型算法相似的学习速度。

Abstract

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose →