TL;DR本文针对伯努利回报情况,首次提供匹配 Lai 和 Robbins 下限所给累积遗憾率的有限时间分析,证明了 Thompson Sampling 是解决随机多臂老虎机问题的最优策略,并通过数值比较和实验验证了这一结论。
Abstract
The question of the optimality of thompson sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of →