We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that the "delayed" Exp3 achieves the $O(\sqrt{(KT + D)\ln K})$ regret bound conjectured by Cesa-Bianchi et al. [2016], in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$. For this algorithm we then construct a novel doubling scheme that forgoes this requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order $\min_{\beta} (|S_\beta|+\beta \ln K + (KT + D_\beta)/\beta)$, where $|S_\beta|$ is the number of observations with delay exceeding $\beta$, and $D_\beta$ is the total delay of observations with delay below $\beta$. The bound relaxes to $O(\sqrt{(KT + D)\ln K})$, but we also provide examples where $D_\beta \ll D$ and the oracle bound has a polynomially better dependence on the problem parameters.

本文研究带有延迟反馈的多臂老虎机问题，证明了先前的算法在延迟是变量但有上界的情况下具有较好的表现，提出了一种新算法通过一个跳过具有过度大延迟的步骤的 wrapper 来降低了对上界的要求，同时构造了一种新的加倍方案，从而放宽了对时间和延迟知识的要求。提出的算法解决了丰富的应用场景问题并达到了合理的预期表现。

具有无限制延迟的非随机多臂赌博机