BriefGPT.xyz
Sep, 2012
关于不安分马尔科夫赌博机的遗憾界限
Regret Bounds for Restless Markov Bandits
HTML
PDF
Ronald Ortner, Daniil Ryabko, Peter Auer, Rémi Munos
TL;DR
本文介绍了一种算法来解决不安分的马尔科夫赌臂问题,并证明了基于指数的策略在这个问题中一定是次优的。该算法可以在不需要假设马尔可夫链除了不可约的任何情况下,经过T步后实现相对于知道所有赌臂分布的最佳策略的 O(√T) 的悔恨。
Abstract
We consider the
restless markov bandit problem
, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an
algorithm
that after $T$ steps achieves $
→