BriefGPT.xyz
Sep, 2020
提高分解平均奖励 MDP 的探索能力
Improved Exploration in Factored Average-Reward MDPs
HTML
PDF
Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard
TL;DR
研究了在未知的分解式马尔可夫决策过程(FMDP)中,以平均奖励标准为基础的遗憾最小化任务。提出了一种新的遗憾最小化策略DBN-UCRL,该策略依赖于对转换函数的单独元素定义的Bernstein类型置信区间,并在标准环境下进行了数值实验。
Abstract
We consider a
regret minimization
task under the
average-reward criterion
in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ an
→