BriefGPT.xyz
May, 2024
可证明高效的无限时间平均回报线性MDP的强化学习
Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs
HTML
PDF
Kihyuk Hong, Yufan Zhang, Ambuj Tewari
TL;DR
设计了一个计算有效的算法,通过将平均奖励设定近似为折扣设定,并且在适当调整贴现因子时,通过运行基于乐观值迭代的算法来实现无限时段平均奖励线性马尔可夫决策过程(MDP)的 O(sqrt(T)) 的遗憾。
Abstract
We resolve the open problem of designing a computationally efficient
algorithm
for
infinite-horizon average-reward linear markov decision processes
(MDPs) with $\widetilde{O}(\sqrt{T})$
→