This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample the losses of every arm at least $\Omega(\sqrt{T})$ times over $T$ rounds, which can adversely affect performance if many of the arms are obviously suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called \emph{Implicit eXploration} (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework. Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.

本文提出了基于 Implicit eXploration 的损失估计策略，可以在不需要不必要的探索成分的情况下，实现高概率遗憾界，取得了多臂赌博问题方面的改进结果。

探索不再：非随机赌博机的改进高概率遗憾界限