The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.

在多臂老虎机游戏中，利用少量样本通过固定置信度水平下的置信区间，提出了一种最初的置信上界算法，该算法使用的样本数量与基于迭代对数定理的下限相比仅相差常数因子，同时使用了一种新的停止时间来避免在其他上置界型算法中观察到的臂联合的界限，从而进一步优化了算法，并通过模拟证明了算法的性能。

lil' UCB: 多臂赌博机的最优探索算法