延迟自适应策略优化及基于滞后赌博反馈的对抗MDP改进的遗憾

May, 2023

延迟自适应策略优化及基于滞后赌博反馈的对抗MDP改进的遗憾

Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

Tal Lancewicki, Aviv Rosenberg, Dmitry Sotnikov

TL;DR研究PO在带有滞后奖励的对抗MDPs中的应用，提出Delay-Adapted PO算法并得到全新的表格MDPs回归界限，在基于线性Q函数的无限状态空间和深度RL应用中都取得了显著的成果。

Abstract

policy optimization (PO) is one of the most popular methods in reinforcement learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we s