TL;DR本文提出了一种新的算法 Discounted Thompson Sampling (DS-TS) with Gaussian priors,用于解决非平稳多臂赌博机问题,并分析了算法在不同情况下的表现和 upper bound of regret。
Abstract
non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. NS-MAB are typically modelled in two scenarios: abruptly changing, where reward distributions remain constant for