BriefGPT.xyz
Jun, 2020
非平稳环境中的策略优化动态遗憾
Dynamic Regret of Policy Optimization in Non-stationary Environments
HTML
PDF
Yingjie Fei, Zhuoran Yang, Zhaoran Wang, Qiaomin Xie
TL;DR
本文提出了两种针对具有对抗性全信息奖励反馈和未知固定转移核的情境MDPs的无模型策略优化算法POWER和POWER ++,并建立了它们的动态后悔保证。
Abstract
We consider
reinforcement learning
(RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free
policy optimization
algorithms, POWER and
→