TL;DR本文提出了针对均值 - 方差 MAB 问题的 Thompson 抽样算法,并在更少的假设条件下提供了高斯和伯努利 bandit 的全面损失分析。我们的算法在各种参数配置下都达到了最好的已知损失边界。
Abstract
The multi-armed bandit (MAB) problem is a classical learning task that
exemplifies the exploration-exploitation tradeoff. However, standard
formulations do not take into account {\em risk}. In online decision mak