BriefGPT.xyz
May, 2023
延迟自适应策略优化及基于滞后赌博反馈的对抗MDP改进的遗憾
Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback
HTML
PDF
Tal Lancewicki, Aviv Rosenberg, Dmitry Sotnikov
TL;DR
研究PO在带有滞后奖励的对抗MDPs中的应用,提出Delay-Adapted PO算法并得到全新的表格MDPs回归界限,在基于线性Q函数的无限状态空间和深度RL应用中都取得了显著的成果。
Abstract
policy optimization
(PO) is one of the most popular methods in
reinforcement learning
(RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we s
→