We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T-o(T)) number of rounds, while incurring only sublinear (o(T)) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.

在对抗式多臂赌博机中，攻击者通过攻击策略干扰损失或奖励信号，以实现对受害者赌徒玩家的行为控制。我们向攻击者显示，攻击者能够引导任何无憾对抗性赌博算法，在每轮之外的几乎所有轮次中选择次优目标臂，而仅产生次线性的攻击成本。这个结果意味着在现实世界中，基于赌博机的系统中存在重要的安全问题，例如，在线推荐中，攻击者可能能够劫持推荐系统并推广所需的产品。我们提出的攻击算法只需要了解后悔率，因此对受害方使用的具体赌博算法没有任何限制。此外，我们还推导了任何受害者不可知攻击算法必须产生的理论下限，并与我们的攻击产生的上限匹配，这表明我们的攻击在渐近意义下是最优的。

对抗性贝叶斯强化学习的对抗攻击