In this study, we propose a new method for constructing UCB-type algorithms for stochastic multi-armed bandits based on general convex optimization methods with an inexact oracle. We derive the regret bounds corresponding to the convergence rates of the optimization methods. We propose a new algorithm Clipped-SGD-UCB and show, both theoretically and empirically, that in the case of symmetric noise in the reward, we can achieve an $O(\log T\sqrt{KT\log T})$ regret bound instead of $O\left (T^{\frac{1}{1+\alpha}} K^{\frac{\alpha}{1+\alpha}} \right)$ for the case when the reward distribution satisfies $\mathbb{E}_{X \in D}[|X|^{1+\alpha}] \leq \sigma^{1+\alpha}$ ($\alpha \in (0, 1])$, i.e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when $\alpha<0$.

提出了基于一种不精确预算方法的智能多臂赌博机构建UCB型算法的新方法；推导出了相应于最优化方法的收敛速度的遗憾界；提出了一种新的算法Clipped-SGD-UCB，并从理论和实证角度展示了在奖励中存在对称噪声的情况下，我们可以达到O(logT√KTlogT)的遗憾界，而不是当奖励分布满足E[X∈D][|X|^(1+α)]≤σ^(1+α)(α∈(0,1])时，即表现得比普遍的重尾赌博机下界所假设的要好。此外，即使奖励分布没有期望，也能保持相同的界限，即当α<0时。

用于带有重和超重对称噪声的随机赌博机的快速UCB类型算法