Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret.

该论文提出了当对手可以适应在线算法的动作时，标准遗憾定义变得不再有效, 定义了替代的政策遗憾概念，用于测量在线算法在适应性对手下的性能，并研究了在线赌徒问题的情况，表明任何赌徒算法都无法针对带有无界内存的适应性对手保证次线性的政策遗憾，但同时提出了将标准遗憾限制在次线性边界以下的任何赌徒算法转换为政策遗憾限制在次线性边界以下的算法的一般技术， 并将这一结果扩展到其他遗憾变体。

面向自适应对手的在线强盗学习：从遗憾到策略遗憾