Consider the domain of multiclass classification within the adversarial online setting. What is the price of relying on bandit feedback as opposed to full information? To what extent can an adaptive adversary amplify the loss compared to an oblivious one? To what extent can a randomized learner reduce the loss compared to a deterministic one? We study these questions in the mistake bound model and provide nearly tight answers. We demonstrate that the optimal mistake bound under bandit feedback is at most $O(k)$ times higher than the optimal mistake bound in the full information case, where $k$ represents the number of labels. This bound is tight and provides an answer to an open question previously posed and studied by Daniely and Helbertal ['13] and by Long ['17, '20], who focused on deterministic learners. Moreover, we present nearly optimal bounds of $\tilde{\Theta}(k)$ on the gap between randomized and deterministic learners, as well as between adaptive and oblivious adversaries in the bandit feedback setting. This stands in contrast to the full information scenario, where adaptive and oblivious adversaries are equivalent, and the gap in mistake bounds between randomized and deterministic learners is a constant multiplicative factor of $2$. In addition, our results imply that in some cases the optimal randomized mistake bound is approximately the square-root of its deterministic parallel. Previous results show that this is essentially the smallest it can get.

多类分类中，我们研究了在对抗性在线环境中依赖强化学习反馈与完全信息之间的差异对最佳错误界限的影响，提供了几乎严格的答案。我们还研究了随机化学习者与确定性学习者之间以及适应性对手与无意识对手之间在强化学习反馈环境下的差距，并与完全信息场景进行了对比。此外，我们的结果表明，在某些情况下，最佳随机化错误界限接近于其确定性对应界限的平方根。

在线多类别分类的强化学习反馈算法：变体和权衡