BriefGPT.xyz
Sep, 2024
可证明有效的无限时间平均奖励强化学习与线性函数逼近
Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation
HTML
PDF
Woojin Chae, Dabeen Lee
TL;DR
本文提出了一种计算上可行的算法,用于学习无限时间平均奖励的线性马尔可夫决策过程(MDP)和线性混合MDP,满足贝尔曼最优性条件。该算法在保证计算效率的同时,对于线性MDP实现了已知的最佳后悔界限,具有显著的理论和实践意义。
Abstract
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear
Markov Decision Processes
(MDPs) and linear mixture MDPs under the Bellman
Optimality Condition
. While
→