We study the contextual linear bandit problem, a version of the standard stochastic multi-armed bandit (MAB) problem where a learner sequentially selects actions to maximize a reward which depends also on a user provided per-round context. Though the context is chosen arbitrarily or adversarially, the reward is assumed to be a stochastic function of a feature vector that encodes the context and selected action. Our goal is to devise private learners for the contextual linear bandit problem. We first show that using the standard definition of differential privacy results in linear regret. So instead, we adopt the notion of joint differential privacy, where we assume that the action chosen on day $t$ is only revealed to user $t$ and thus needn't be kept private that day, only on following days. We give a general scheme converting the classic linear-UCB algorithm into a joint differentially private algorithm using the tree-based algorithm. We then apply either Gaussian noise or Wishart noise to achieve joint-differentially private algorithms and bound the resulting algorithms' regrets. In addition, we give the first lower bound on the additional regret any private algorithms for the MAB problem must incur.

本篇论文研究了解决上下文线性赌博机问题的隐私学习算法，其中采用联合差分隐私的定义将经典的线性-UCB算法转换成联合差分隐私算法，并在其中使用高斯噪声或Wishart噪声，使结果算法的遗憾得到了限制。此外，还给出了任何MAB问题私有算法必须产生的额外遗憾的第一个下限。

差分隐私上下文线性赌博机