Thompson抽样的无先验和有先验依赖的遗憾界

Apr, 2013

A note on the Bayesian regret of Thompson Sampling with an arbitrary prior

Sébastien Bubeck, Che-Yu Liu

TL;DR研究具有奖励分布先验分布的随机多臂赌博问题，证明Thompson Sampling算法在没有先验分布时具有最优的贝叶斯遗憾上界，并在Bubeck等人的先验设置下证明了算法的一致界限，并与Audibert和Bubeck [2009]和Russo和Roy [2013]的技术方法有关。

Abstract

We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We show that for any prior distribut