均值方差赌博机的汤普森采样算法

Feb, 2020

均值方差赌博机的汤普森采样算法

Thompson Sampling Algorithms for Mean-Variance Bandits

Qiuyu Zhu, Vincent Y. F. Tan

TL;DR本文提出了针对均值-方差MAB问题的Thompson抽样算法，并在更少的假设条件下提供了高斯和伯努利bandit的全面损失分析。我们的算法在各种参数配置下都达到了最好的已知损失边界。

Abstract

The multi-armed bandit (MAB) problem is a classical learning task that exemplifies the exploration-exploitation tradeoff. However, standard formulations do not take into account risk. In online decision making sy