We consider the classical multi-armed bandit problem with Markovian rewards. When played an arm changes its state in a Markovian fashion while it remains frozen when not played. The player receives a state-dependent reward each time it plays an arm. The number of states and the state transition probabilities of an arm are unknown to the player. The player's objective is to maximize its long-term total reward by learning the best arm over time. We show that under certain conditions on the state transition probabilities of the arms, a sample mean based index policy achieves logarithmic regret uniformly over the total number of trials. The result shows that sample mean based index policies can be applied to learning problems under the rested Markovian bandit model without loss of optimality in the order. Moreover, comparision between Anantharam's index policy and UCB shows that by choosing a small exploration parameter UCB can have a smaller regret than Anantharam's index policy.

考虑带Markov奖励的经典多臂赌博机问题，玩一只手臂时，其状态会按Markov方式更改，不玩时保持冻结。玩一只手臂时，玩家会获得与状态相关的奖励，每只手臂的状态转移概率未知。我们证明在手臂的状态转移概率满足一定条件下，基于样本均值的指数策略能够在总试验次数上实现对数遗憾，同时也证明了在具有休息的Markov赌博机模型下，样本均值指数策略不会降低最优性。此外，对比Anantharam的指数策略和UCB，我们发现通过选择一个小的探索参数UCB可以比Anantharam的指数策略拥有更小的遗憾。

具有马尔可夫回报的多臂赌博机问题的在线算法