BriefGPT.xyz
Feb, 2020
在约束条件下的马尔可夫决策过程学习
Learning in Markov Decision Processes under Constraints
HTML
PDF
Rahul Singh, Abhishek Gupta, Ness B. Shroff
TL;DR
本文研究如何在满足成本平均值约束条件下,通过设计基于模型的强化学习算法,从而最大化累积奖励,同时确保每个成本值的平均值被绑定在特定的上界之内。此外,我们提出了一种衡量强化学习算法表现的方法,即使用M+1维的后悔向量来衡量奖励和不同成本的差异,并证明了UCRL-CMDP算法的后悔向量的期望值的上界为O(T ^ {2/3}).
Abstract
We consider
reinforcement learning
(RL) in
markov decision processes
(MDPs) in which at each time step the agent, in addition to earning a reward, also incurs an $M$ dimensional vector of costs. The objective is
→