We study a strategic variant of the multi-armed bandit problem, which we coin
the strategic click-bandit. This model is motivated by applications in online
recommendation where the choice of recommended items depends on both the
click-through rates and the post-click rewards. Like in classical bandits,
rewards follow a fixed unknown distribution. However, we assume that the
click-rate of each arm is chosen strategically by the arm (e.g., a host on
Airbnb) in order to maximize the number of times it gets clicked. The algorithm
designer does not know the post-click rewards nor the arms' actions (i.e.,
strategically chosen click-rates) in advance, and must learn both values over
time. To solve this problem, we design an incentive-aware learning algorithm,
UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm
behavior under uncertainty; (b) minimizing regret by learning unknown
parameters. We characterize all approximate Nash equilibria among arms under
UCB-S and show a $\tilde{\mathcal{O}} (\sqrt{KT})$ regret bound uniformly in
every equilibrium. We also show that incentive-unaware algorithms generally
fail to achieve low regret in the strategic click-bandit. Finally, we support
our theoretical results by simulations of strategic arm behavior which confirm
the effectiveness and robustness of our proposed incentive design.

我们研究了多臂赌博问题的战略变体，称为战略点击赌博问题。我们设计了一种激励感知的学习算法 UCB-S，该算法实现了在不确定性下激励期望的臂行为，并且能够学习未知参数以最小化遗憾度。我们的理论结果得到了通过模拟战略臂行为进行的支持，证实了我们所提出的激励设计的有效性和鲁棒性。