We consider the setup of stochastic multi-armed bandits in the case when reward distributions are piecewise i.i.d. and bounded with unknown changepoints. We focus on the case when changes happen simultaneously on all arms, and in stark contrast with the existing literature, we target gap-dependent (as opposed to only gap-independent) regret bounds involving the magnitude of changes $(\Delta^{chg}_{i,g})$ and optimality-gaps ($\Delta^{opt}_{i,g}$). Diverging from previous works, we assume the more realistic scenario that there can be undetectable changepoint gaps and under a different set of assumptions, we show that as long as the compounded delayed detection for each changepoint is bounded there is no need for forced exploration to actively detect changepoints. We introduce two adaptations of UCB-strategies that employ scan-statistics in order to actively detect the changepoints, without knowing in advance the changepoints and also the mean before and after any change. Our first method \UCBLCPD does not know the number of changepoints $G$ or time horizon $T$ and achieves the first time-uniform concentration bound for this setting using the Laplace method of integration. The second strategy \ImpCPD makes use of the knowledge of $T$ to achieve the order optimal regret bound of $\min\big\lbrace O(\sum\limits_{i=1}^{K} \sum\limits_{g=1}^{G}\frac{\log(T/H_{1,g})}{\Delta^{opt}_{i,g}}), O(\sqrt{GT})\big\rbrace$, (where $H_{1,g}$ is the problem complexity) thereby closing an important gap with respect to the lower bound in a specific challenging setting. Our theoretical findings are supported by numerical experiments on synthetic and real-life datasets.

本文研究了随机多武器歹徒问题的设置，在未知变化点的情况下，将奖励分配为分段独立同分布且有界。我们集中研究了所有武器同时发生更改的情况，并针对涉及变化量（∆{^{chg}_{i,g}}）和最优间隙（∆{^{opt}_{i,g}}）的依赖间隙（而不仅是间隙独立的间隙）后悔边界。在不知道变化点的情况下，我们介绍了两种UCB策略的自适应性，并采用扫描统计技术，以积极检测变化点。

分布相关和时间均匀的分段 i.i.d. 摇臂界